The underlying factors affecting users’ choices of what to watch on TV have for several years been of interest to commercial and academic research. In the midst of a rapidly changing device and multimedia landscape, TVs continue to be at the core of multimedia consumption in the home with scenarios covering, among others, social gatherings and solitary immersive moments. The inherent complexity of viewing situations challenges the creation of experiences that match personal preferences as well as temporal and social contexts.
Due to the increased availability of multimedia, research has been focused on improving the users’ decision process by reducing large catalogs of content to a few personalized suggestions . Commercial recommender solutions are now considered core to the business of engaging users and thereby preventing abandonment . To do so, recommender systems have explored various features for personalization, such as history of watching, ratings, user/item similarity, and time of the day, the last of which is an example of features characteristic to context-aware recommender systems (CARS) . The main objective of a recommender system is to personalize the experience to the individual, often by studying the user-item matrix. This could be an issue, since an account on a TV is often shared by multiple members of a household that end up diluting the user profile. A solution is to allow sub-accounts, but this does not solve the problem when a group of users want a shared experience , which is the focus within the research field of group recommender systems . Hence, to achieve recommendations aimed at dynamic compositions of users, the system needs to be aware of the social context it is used in, and thus more features than those classically used for personal predictions are needed . Recent studies even suggest to decouple the goal of tailoring the experience to the individual (personalize) from tailoring to the situation and intent (contextualize) , thereby focusing more on the immediate context than the past behavior of a user. In this study, we investigate some of the advantages and disadvantages of contextless personalization as well as contextualized suggestions. We focus less on algorithmic improvements within each approach, e.g. optimization of contextless personalization, and more on the contribution of different contextual dimensions.
Even though the concept of context-aware recommendations has been studied in several academic and commercial projects, there is still a need for publicly available datasets since only a limited number of such datasets exist, e.g. . Furthermore, the majority of existing CARS datasets are based on explicit feedback, often in the form of ratings, e.g. for movies. However, within TV recommender systems the feedback is usually implicit in the form of watched/not watched, since continuously probing users for explicit feedback would significantly alter the viewing experience. The use of implicit feedback also means that a user can provide feedback for the same content multiple times, which is typically not the case for e.g. movie ratings.
Another challenge of using existing datasets for developing CARS within the TV domain, is that the by far most dominant contextual feature is timestamps, i.e. temporal information. Though TV viewing is highly driven by habits linked to temporal context such as time of day, day of week, and season, CARS based exclusively on temporal information could miss non-trivial correlations between e.g. temporal and social contexts. It is, however, challenging to collect TV consumption data that includes contextual information beyond timestamps. People meters111A device used to log users’ TV viewing behavior. For identification, the device often relies on participants to push a button on a remote when they enter and leave., for instance, are challenged, , by non-compliance (participants neglect to push a button), and secondly, since meters log the opportunity to consume some content, there is no information of the actual exposure, i.e. the TV could be showing some content that the user does not watch.
In this study, we collect and analyze self-reported interactions for a limited number of content classes (genres). We include several contextual features, which allow inspection of patterns in the consumption of different genres. In addition to the comprehensively studied temporal information, our contribution includes novel investigations of associations between consumed content and social settings, e.g. who is present. Using the Experience-Sampling Method (ESM) , we ask participants to report TV consumption multiple times each day for a five week period. Through self-reported data, we decrease uncertainty of exposure to content, and allow collection of non-trivial information, such as how much attention is paid to the TV. The data is structured to accommodate quantitative analyses, e.g. in the CARS community, and is publicly available under the name Contextual TV (CTV) dataset222Available at http://kom.aau.dk/~zt/online/ContextualTVDataset.. Note that the self-reported information provided by participants could potentially be collected implicitly at run-time in a real-world implementation.
Using different feature configurations, we also show how well-established methods perform in predicting consumed content given contextual settings, and compare this with contextless prediction. In an initial study, we showed the effectiveness of including contextual information , which we expand in the present contribution with in-depth data analysis and detailed investigation of prediction performance. That is, we assess gains of adding contextual knowledge to the prediction task, and study contributions from each contextual dimension to the overall ability to predict which genre a user will engage with in a given situation.
The rest of the paper is organized as follows. We start by surveying related work in Section II. In Section III we introduce and analyze the contextual TV dataset. Section IV presents methods for predicting content and evaluates different configurations. Finally, Section V and VI discuss the findings and conclude the study.
Ii Related Work
Ii-a Contextual Aspects of Watching TV
Previous studies of users’ TV watching behavior in given contexts have shown that the TV is mostly a social platform and consumption takes place in a wide variety of situations. In  30 households were scan-sampled every 10 minutes for four days, to reveal patterns of who was watching, when, and with whom. Noticeably it was found that 64% of the time, family members were watching TV together.  presents the results of an online survey with 550 valid responses. Their results confirm that contextual settings, and not only the users’ personal profiles, are of importance to the decision of what content to watch. In  a multi-method field study is conducted with 11 participants. Both temporal and social settings are highlighted as key contextual indicators of consumed content. An example of collecting qualitative contextual TV consumption data using diaries is presented in  for 12 households over a three week period. The main difference between the studies listed above and our work, is that we aim for a quantitative dataset that avoids the recall bias associated with questionnaires. To this end, ESM has proven useful for obtaining frequency and patterning of daily activities and social interactions . In a recent study, 
combined automated data logging from the TV with event-triggered ESM to show, among others, how social context affects TV volume. That is, they used a number of sensors to automatically extract contextual settings of TV viewing events, e.g. Bluetooth trackers to identify present users and their activity level, together with chatbot sessions for obtaining self-reported information of e.g. social context.
Ii-B Recommending Based on Context
The task of recommending content to users based on their past behavior as well as context, is an active research field. Early work focused on pre- and post-filtering 
, while recent studies have included contextual information directly in the model. The main approaches are tensor factorization, factorization machines [18, 19]
, and most recently efforts based on deep learning such as. In  an Android application is used to collect smartphone sensor-based contextual point of interest data from 90 students for a month long period, and recommendations are based on deep auto-encoding. Like in this work, they use ESM for collecting feedback from participants.
|[22, 23, 24, 25]||✓|
|[28, 29, 30]||✓||✓|
Within TV content recommendation, several studies have based recommendations on context. Table I summarizes related works, and shows which contextual features are included in each study. The table also lists the availability of data from each contribution (to the best of the authors’ knowledge). An early example that includes contextual information is presented in  that collected people meter data from approximately 60 participants for one year. Their system has three main components: explicit user modeling based on explicit feedback from users; stereotypical user modeling based on age, gender, etc.; dynamic user modeling based on implicit information inferred from users’ viewing behavior in certain temporal contexts. Recently,  and  similarly studied temporal aspects of recommending TV content. In 
approximately 100 participants provided diary data for two weeks. Additional to temporal information, the collected viewing contexts include three mood selections (happy, bored, and unhappy). Feedforward neural networks are used for recommending content, and their results suggest that users’ emotional state helped improve the performance.
presents recommendations using support vector machines (SVM) for people meter data collected from 20 families in Finland during a five month period. The viewing contexts consist of temporal and social (additional viewers) information. Their results suggest that including social context makes a minor improvement, but that the improvement depends on family habits, i.e. the correlation between temporal and social settings in families’ typical viewing behavior. also include social context, using RFID tags to identify users. They evaluate their system in a real-world implementation. In , users’ moods are used to improve navigation of programs available in the electronic program guide (EPG).  presents a people meter dataset containing implicit viewing events with timestamps for a four month duration. They smooth temporal context and use distance between contextual settings to recommend TV programs.  presents a comparable dataset, but includes familiar context, that is, the additional users watching. Their results suggest that temporal context cancels the effect of social context when using both to recommend TV content. In  more than 700 million views collected on a Tunesian TV platform are used to recommend TV content in a given context. Their viewing contexts are defined by location, time/day, weather and occasion.
As evident from the literature listed above and in Table I, context-aware TV recommendations have primarily revolved around quantitative data collected through (people) meters. Though meters have a lot of advantages, such as enabling easy large-scale implicit feedback collection, they do (as previously stated) suffer from e.g. non-compliance and actual exposure uncertainty. To better embrace the complexity of viewing situations, we ask participants to provide information that are not easily accessed through meters, such as how much attention a user is paying in a given viewing situation. Also, instead of relying on participants to continuously register their presence in front of the TV using a remote controller, our adopted data collection method is chosen to reduce noisy measurements of social settings. Another observation from the literature is that studies have difficulties comparing their findings to those of other works, partially because there is no tradition of sharing results on common datasets, as is the tradition within e.g. movie recommendation. Thus, to the best of the authors’ knowledge, the present contribution is the first to publicly share a dataset with TV viewing events that include contextual settings beyond timestamps.
According to , users tend to choose content such that a few dominant items or providers account for the majority in consumption. They refer to this phenomenon as contextual bias, and discuss how this decreases the diversity of recommenders. In this paper, we include scores that enable assessment of diversity in predictions.
Iii Contextual TV Watching Dataset
This section details the procedure for collecting the dataset, and highlights a number of patterns within the data. The quantitative analysis is focused on general contextual tendencies of viewing situations as well as considerations of temporal and social context dynamics.
Iii-a Experimental Protocol
|Q1:||Have you watched TV within the last four hours?||Yes, no|
|Q2:||Who were you watching it with?||Multiple-option: Alone, partner, child (0-12), child (12+), sibling, parent, friend, other (text)|
|Q3:||How many people (including yourself) watched TV?||1, 2, 3, 4, 5+|
|Q4:||What did you watch?||Multiple-option: News, sport, movie, series, music, documentary, entertainment, children’s, user-generated, other (text)|
|Q5:||Which service(s) did you use?||Multiple-option: Traditional TV, DRTV, TV2 Play, Viaplay, Netflix, HBO Nordic, YouTube, other (text)|
|Q6:||How much attention did you pay to the TV?||None-full (5 steps)|
To obtain data from participants we developed a web page, thereby allowing access from all devices equipped with Internet access and a web browser, though we recommend the use of mobile devices. Participants were asked to answer questions five times every day at 8, 12, 17, 20, and 22 (or when going to bed) for a period of 36 days. These intervals were chosen to accommodate work and study schedules, while still providing ample opportunity to participate over a full day period. Participants were allowed to answer more frequently and at other times than the five pre-specified intervals. Scheduled reporting was preferred to event-based reporting, e.g. asking participants to answer after every TV watching session, since this requires less from the participants to remember and enables evaluation of compliance. Also, scheduled reporting allows signaling to participants to help remind them about the study. Specifically, we used a public calendar with alerts for iOS devices and web push notifications for all other types of devices. Prior to launch, a three-day pilot test was conducted involving 12 potential participants evaluating web page and reminders. Participants for the main study were recruited through social media with a lottery of three loudspeakers worth €170 each as incentive at the end of the study. A requirement for joining the lottery was at least 14 days of active participation.
The first time a user visits the web page that person is instructed to answer background information questions as part of the enrollment procedure. The collected information includes: Gender, age group, language (Danish/English), device type, household size, additional household members, frequency of TV watching, and favorite TV genres.
On subsequent logins, participants were asked the questions listed in Table II. The questions are designed to have a low cognitive load and take less than 30 seconds to answer. The general flow is that Q2-Q6 are asked only if the selection for Q1 is yes. Also, Q3 is skipped if alone is selected for Q2. For Q5 all except Traditional TV (and possibly other) are streaming services, some specific to Denmark/Scandinavia. The multiple-option questions allow more than one selection, e.g. partner and friend. Participants are instructed to split answers with different contextual settings, e.g. watching news alone and children’s TV with a child. Answers are logged with the following format: Answer ID, User ID, timestamp, Q1, Q2, Q3, Q4, Q5, Q6. In this study we extract two pieces of information from the timestamp. One is the day of the week that can be used to determine whether it is weekend. Also, we group public holidays and weekend unless stated otherwise. The other feature is the time of the day. We use five groups: 1) Morning: 6-10; 2) Noon: 10-14; 3) Afternoon: 14-18; 4) Evening: 18-22; 5) Night: 22-6.
Iii-B Data Analysis
Iii-B1 Participants’ Background
A total of 118 participants (64 male and 54 female) in the age range 13-70 took part in the ESM study. 57% of the participants were in the age range 21-30, and 84% lived in households with at least two members including themselves. At the enrollment, 81% reported that they watched TV daily and 97% that they watched at least once a week. Concerning reported favorite genres, the series genre attracts the largest audience (89). Movie (75) and documentary (73) are also popular, while entertainment (63) and news (61) follow as fourth and fifth, respectively. Sport (44) is sixth, user-generated (21) seventh, and music (17) and children’s (17) share the last place.
Fig. 1 shows the development in enrolled and active (at least one answer that day) participants each day of the 36 days in the study. Notice the relatively large drop-off between enrolled and active participants within the first five days of the study, mainly caused by one-time visitors (a total of 31 throughout the study). From day seven onwards, the number of active participants decreased on average by three every fourth day. The average number of active participants per day was 53, and a total of 60 participants met the requirement of at least 14 active days. Saturdays (day 5, 12, 19, etc. of Fig. 1) had a tendency of fewer active participants, and also showed to be the day with the least responses (935) compared to the most on Wednesdays (1115). Notably, it seems that part of the inactive users returned Sundays after a ”day off”. In terms of time of day, participants answered most frequently in the evening (1784), the least in the morning (1203), and the remaining ranked as follows: night (1468), noon (1418), and afternoon (1328).
Iii-B3 TV Consumption
The dataset consists of 6443 answers. Each answer including more than one selected option for Q4 is split (with the same values for the other entries), which brings the total number of answers to 7201. From these, 3090 are answers with yes for Q1. Fig. 2 shows the distribution of answers for Q1 and Q4. It is worth noticing that series as the most frequent selection for Q4 accounts for approximately 25% of the answers. This is in accordance with the reported TV favorites in the background information of the participants, where 75% of the participants reported series as a favorite. When comparing the reported TV content favorites with Fig. 2, two genres stand out in particular, namely movie and documentary. These are close competitors for second place among reported TV favorites, but in answers for Q4 they have relatively low counts. This may not come as a surprise, since they typically have a longer duration and might require more attention than the other genres, and thus may not be consumed as frequently. In addition to being the most watched genre, on average participants report to be slightly more attentive when watching series compared to both movie and documentary, as shown in the top of Fig. 3. The reason for the low counts of movie and documentary needs more analysis, and it would be interesting to study whether it is because users struggle to find and select content within these two genres in particular.
one standard deviation.
Another point to highlight in Fig. 3 is that music and children’s have low average attention levels compared to the other genres. In the case of children’s it is most likely because the TV is used primarily by children of the respondents. Also, as seen from the bottom of Fig. 3, children’s shows mainly to be a social genre. Notice the relatively large standard deviation within the genres sport and music indicating that these are consumed in different social settings, sometimes by one user and at other times by groups of users.
Pearson’s chi-square test is used to measure the level of association between the choice of genre (removing answers with the selection other) and the contextual features. Also, Cramér’s
is reported. To this end, a contingency table is formed for each contextual dimension, and results are presented for all cases where at least 80% of the expected frequencies are above five and none are zero. A significant interaction is found between genre and time of day ((32)=326.39, p0.001, =0.16). The same applies between genre and weekday/[weekend/holiday] ((8)=125.52, p0.001, =0.20), additional viewers ((56)=1192.53, p0.001, =0.21), number of viewers ((32)=540.47, p0.001, =0.21), attention level ((32)=593.36, p0.001, =0.22), and service ((56)=2169.12, p0.001, =0.29).
Iii-B4 Temporal and Social Aspects of TV Consumption
We have shown that the choice of genre is associated with several contextual settings. However, questions, such as what changes during the day, in the weekend, or in social situations, remain unanswered, and are hence investigated below.
A fundamental difference is the consumption pattern. As shown in Fig. 4, TV watching in social settings (with at least one co-viewer) happens most frequently during the evening or night both for workdays (75%) and weekends (70%). During morning, noon, and afternoon social situations account for the remaining 25% of observations for weekdays and 30% for weekends in social situations. TV watching in solitary settings is more spread throughout the day, though weekdays show the same tendency with evening and night being the dominant time slots. Independent of social context, TV watching in the morning occurs most frequently during weekdays, while the share of noon and afternoon viewing increases in weekends. Also note from the figure that approximately 57% of all observations (3090) take place in a social context (1752), while 63% are during workdays (1956).
Another element to consider is the attention level of the users as presented in Fig. 5. Generally, users pay more attention to the content as the day progresses. A notable exception is social afternoons in weekdays that have the lowest average attention level among all social settings, which could possibly be because users have just returned back home from work and are engaging in conversations and other activities while watching TV. There is also a tendency that users pay more attention when they are alone. Exceptions are mornings and nights, where the levels are approximately similar. Interestingly, when compared to watching alone, it seems that the way users co-view TV changes between afternoon, evening, and night, such that the social activity turns to be more focused around watching TV intensively as it gets late.
Lastly, Fig. 6 shows how the temporal and social context influence the choice of genre. Note that the height of the bars indicates the share of a genre within one of the four contextual settings, alone+weekday, alone+weekend, social+weekday, and social+weekend. Hence, it can for example not be concluded that movie is watched more in weekends than weekdays in social settings (actually, the opposite is the case). It can, however, tell that the proportion of times movie is selected over other genres is higher in the weekend. A number of genres show clear differences between weekdays and weekends. News, series, and entertainment are preferred during weekdays, while sport and movie increase their share considerably during weekends. Though series are watched less in the weekends, it is the genre with the largest share for all contexts. Entertainment is second in three out of four contexts, but drops behind sport and movie to fourth for social weekends. News follows just behind entertainment in weekdays, but it is the genre that decreases most in weekends. Movie and children’s are preferred in social settings, while user-generated is mainly consumed when alone, which could possibly be because the genre has its main roots on smaller screens, where users mainly consume it in solitary settings. The proportion for music is similar among the four contexts.
Iv Prediction of Preferences
In the previous section we showed how contextual settings influence viewing situations contained in the collected dataset. In this section we present how the contextual features of the dataset can be used for prediction of consumed content. The goal of this study is to predict what genre a user is going to watch (Q4) in the reported context. Many different methods can be applied to this challenge, such as state-of-the-art factorization machines  or neural networks 
. The focus of this investigation is on the contribution of the contextual dimensions and the type of errors, for which reason we do not present the results of a wide palette of algorithms, but rely on a few that are well-known within the machine learning community. We do, however, show how temporal and social context affects the prediction ability, and compare it to e.g. contextless prediction.
Iv-a Features and Methods
The task is defined as a multi-class classification problem with the users’ selections for Q4 as target. The selections for the remaining questions are used as contextual features (see Table III
). All features are categorical and represented using one-hot encoding. The optional text input forother in Q2, Q4, and Q5 are not included in this study.
|(T)||Time of day||5||Timestamp|
|(M)||Number of viewers||5||Q3|
Table III also lists a number of feature configurations. In this work, we define the feature configuration all as service-independent. Thus, all is a collection of all features except the service feature. The reasoning behind this definition is that a prediction or recommendation of genre will have most impact across providers, since some services have a very targeted range of content genres, e.g. YouTube relies heavily on the user-generated genre. The service feature is included in the all+S configuration. Other configurations can be used as well, e.g. time of day and weekday/weekend (TD). A notable configuration is what we refer to as the contextless, which consists of purely user identity information (U).
Six methods are compared. Scikit-learn 
implementation of logistic regression (LR), gradient boosting decision trees (GBDT), support vector machines (SVM), multi-layer perceptrons (MLP), and two baseline methods, namely the most popular (toppop) and random (random) predictors. For toppop, genres are ranked by their popularity judged by the number of observations in the training set. The random predictor randomly ranks the genres for each prediction.
The methods are evaluated using nested cross-validation (also referred to as double cross-validation) with five outer folds and three inner folds. That is, in the outer loop the dataset is split in five folds, using one fold in turn as test set and the remaining four folds as training set. The training set for each outer iteration is further divided into three inner folds for optimization of hyperparameters. To this end, on a rotational basis, two (inner) folds are used for training and one is used as validation set. The best scoring hyperparameter configuration of the inner loop is used to assess the predictive performance of the model on the outer test fold by training on the full training set. We report the average performance across the outer folds and the standard deviation. Users that have not answered at least five times are not included in the evaluation.
Iv-B1 Configuration of Hyperparameters
Due to considerations of computational complexity, some hyperparameters are determined empirically prior to run-time and are static throughout feature configurations. We fit the LR weights using stochastic average gradient descent with L2 regularization, and set the multi-class parameter to ”multinomial” for softmax regression. For GBDT, we use 1000 boosting stages, each fitted on a random subsample consisting of 50% of the training samples. The SVM use a one-vs-rest decision scheme, and MLP is implemented with two hidden layers each consisting of 200 neurons with rectified linear (ReLU) activation functions and a softmax output layer, optimized during training using Adam and L2 regularization of weights. During hyperparameter tuning in the inner loop of the nested cross-validation, the following variables are determined: LR - regularization strength; GBDT - maximum depth of individual trees; SVM - kernel type (linear/RBF), kernel coefficient and regularization strength; MLP - regularization strength.
The accuracy at K predictions (A@K) is used as a metric for evaluation. At K larger than one, multiple guesses are allowed for each trial. It is calculated using:
where is the number of trials. is the indicator function, which is one if the prediction, , is equal to the actual target, , and zero otherwise. The is sorted with predictions in ascending order according to confidence score, such that
is the most probable prediction. The mean reciprocal rank (MRR) is used to assess the average ranking () of the true targets:
where is the total number of target classes. We also report F1 scores with macro averaging. F1 (macro) is an average of each individual class performance. It gives equal weight to classes, which means it can be used to assess the performance on small classes. This makes F1 (macro) an indicator of a method’s ability to predict diverse target classes. On the other hand, F1 (micro) pools all trials with equal weight. Thus, classes with many samples will dominate classes with few samples. F1 (micro) is with this setup (multi-class single-label) equal to A@1 and therefore not presented explicitly.
Iv-C1 Feature configuration all
The performance using feature configuration all is shown in Table IV. It is not a surprise that random scores approximately 0.1 in A@1 (10 target classes), and that toppop scores close to 0.25 since it was shown that series as the most watched genre accounts for approximately one fourth of the data points. Also, toppop outperforms random in terms of A@3 and MRR. Note, however, that random performs better than toppop for F1 (macro), due to the diversity in predicted genres.
The remaining methods achieve considerably higher scores than both baselines. LR, as the best scoring, almost doubles the A@1 of toppop, and successfully predicts the genre in approximately 44% of the cases and 82% when allowing three guesses. The MRR indicates that on average the true genre is ranked among the first and second (as indicated by 1/MRR1.6) of the 10 possible genres. The corresponding numbers for toppop and random are 2.2 and 3.4, respectively.
|random||0.101 (0.005)||0.295 (0.004)||0.087 (0.008)||0.290 (0.005)|
|toppop||0.245 (0.009)||0.560 (0.014)||0.039 (0.001)||0.460 (0.008)|
|MLP||0.413 (0.018)||0.787 (0.013)||0.335 (0.022)||0.621 (0.012)|
|GBDT||0.417 (0.014)||0.786 (0.021)||0.354 (0.027)||0.623 (0.009)|
|SVM||0.425 (0.009)||0.809 (0.021)||0.358 (0.015)||0.632 (0.012)|
|LR||0.437 (0.026)||0.815 (0.010)||0.373 (0.031)||0.641 (0.013)|
Iv-C2 Genre confusions
A deeper look into the prediction errors of LR for feature configuration all
is shown in a confusion matrix in Fig.7. Series receives many predictions (1022) compared to the actual number of occurrences (740) resulting in the best recall score of all the genres (true positives over number of observations, ). Despite the many predictions, it also manages to achieve one of the highest precision scores (true positives over number of predictions, ). The opposite is the case with documentary, which receives few predictions (56) compared to observations (228), leading to a low recall score (0.07). Together with the precision score of 0.27, the resulting F1 score is 0.11, which emphasizes that documentary is difficult to predict in this specific setup. It can be seen from the off-diagonal entries that there are three main genres that are confused with documentary, being news, series, and entertainment.
Two genres stand out in terms of F1 score, namely series and children’s that both achieve a score around 0.6. Next is music with a score of 0.5, which gets a majority of its false positives when the observed genre is user-generated. The user-generated genre has a high (relative) precision, but a low recall causes the F1 score to fall to an average level among the genres, meaning that even though it is hard to retrieve user-generated content for recommendation, when it is finally predicted, the predictions are fairly reliable. News, sport, and entertainment also have F1 scores that are close to the F1 (macro). In addition to documentary, the movie genre underperforms. The F1 score of movie is low mainly due to a low recall, caused to a high degree by series, but it is also frequently confused with news, sport, and entertainment.
Iv-C3 Contextual dimensions
A comparison of the methods for multiple feature configurations is shown in Fig. 8
. Note that the baseline scores are independent of feature selection. Also, as can be seen from the figure, in general there is not a large deviation between performance of the different methods within each feature configuration. Therefore, we mainly highlight results between feature configurations in the following.
The worst performing feature configuration in Fig. 8 is the temporal configuration, TD, for which the performance of LR almost decreases to comparable results of the baselines, though the A@1, A@3, and MRR are significant better than those of the random predictor, and likewise the F1 (macro) score is higher than what toppop achieves.
Adding social aspects (TDTDW) significantly improves performance according to McNemar’s test333A matrix, , is formed with being the number of trials where both methods are correct, and are the trials where one of the methods fail, and is when both are incorrect. McNemar’s test statistic and Cramér’s are then computed as:
, . (for LR: (1)=29.10, p0.001, =0.09). Furthermore, knowledge of additional viewers and attention level, WA, scores on par with TDW, and even slightly improves in terms of F1 (macro).
Notably, contextless prediction, U, outperforms all-U, which indicates the importance of user specific behavior and habits (for LR: (1)=26.33, p0.001, =0.09). The F1 (macro), however, does not see as large an improvement between the two configurations as the three other metrics. Adding temporal context to the user ID (UUTD) increases performance slightly, while F1 (macro) improves considerably suggesting that the temporal information enhances the ability to predict diverse genres. As is the case with TDWA, UWA performs significantly better than UTD (for LR: (1)=9.83, p0.002, =0.06). In fact, UWA provides results on the same level as all with no significant difference according to McNemar’s test (for LR: (1)=3.52, p0.05, =0.03). Lastly, knowing which service the user will be using, all+S, achieves the highest scores.
The predictive performance using LR for feature configuration all (see Fig. 7) is best for the two genres series and children’s in terms of F1 score. According to the data analysis (presented in Section III-B), children’s exhibit distinctive contextual preferences, such as low attention level and multiple viewers, resulting in fewer prediction confusions. The opposite is the case with documentary and movie that are not easily distinguishable from the features collected in this study, which also makes them the genres with the lowest F1 scores. A reason for the worse performance when compared to series, could be the imbalance of the dataset, and the overall score of the less viewed genres could possibly improve by handling this. Also, one should keep genre ambiguities in mind when evaluating the system. For example, a movie could be a children’s movie, thereby potentially covering two genres in this study. Assuming that a participant is labeling consistently throughout the study, these ambiguities would primarily result in reduced performance when making use of information across users.
The analysis of the collected data suggests that contextual aspects are an integral part of users’ decision process when selecting what content to watch on TV. This hypothesis is supported by the findings of Section IV, in which it is shown that the inclusion of contextual information is positively associated with improved predictive performance, both in terms of accuracy and diversity. Note that this is without inducing prior expert knowledge of certain situations, such as the often easily distinguishable children’s setting. That being noted, the results also highlight the importance of knowing who is watching, since all-UU shows significant improvement. Interestingly, though adding temporal information to the user ID (UUTD) only shows slight improvement in accuracy, it clearly indicates more diverse predictions. Hence, contextless prediction (U) achieves high accuracy scores, but it seems to be at the expense of the ability to predict diverse genres. However, from the results obtained in this work, contextual prediction benefits from knowing user habits, and drawing on contextual patterns of viewing situations should thus not exclude personalization based on past behavior of each individual.
As opposed to the findings of  that the effects of social context are canceled when also considering temporal context, our results suggest a significant improvement when adding social context to temporal, e.g. TDTDW. This is also evident in terms of diversity in the predicted genres. What users watch, when, and with whom may be correlated, shown by the habitual behavior of users when watching TV, but our results indicate that knowing the social context of a viewing situation will enable the system to adapt to some scenarios deviating from temporal habits. As an example, content chosen for Friday nights could be very different depending on the social company of the user, while Monday mornings frequently consist of the same users and content.
Vi Concluding Remarks
In this paper, we introduced the novel and publicly available CTV dataset of TV consumption enriched with contextual information that contributes to the evaluation of TV viewing in the home. To this end, we conducted an extensive field study over a period of five weeks with a group of more than 100 participants. Using the dataset, we showed associations between different aspects of TV watching, e.g. users’ average attention level of genres, and how these change over time and in different social situations. Furthermore, we evaluated to which degree contextual knowledge influences the performance of predicting what content will be consumed. The experimental results showed that inclusion of contextual information significantly improves accuracy and diversity compared to contextless predictions, but also that knowledge of past behavior is essential to achieve high accuracies.
In future work we plan to apply methods that have proven successful within the context-aware recommender systems community, and evaluate how they adapt to a dataset with a limited number of target classes and multiple interactions between the same user and target. This would include deep models suitable for inferring latent contextual features.
This work is supported by Bang and Olufsen A/S and the Innovation Fund Denmark (IFD) under File No. 5189-00009B.
-  D. Véras, T. Prota, A. Bispo, R. Prudêncio, and C. Ferraz, “A literature review of recommender systems in the television domain,” Expert Systems with Applications, vol. 42, no. 22, pp. 9046–9076, Dec. 2015.
-  C. A. Gomez-Uribe and N. Hunt, “The Netflix Recommender System: Algorithms, Business Value, and Innovation,” ACM Trans. Manage. Inf. Syst., vol. 6, no. 4, pp. 13:1–13:19, Dec. 2015.
-  G. Adomavicius and A. Tuzhilin, “Context-Aware Recommender Systems,” in Recommender Systems Handbook. Springer, Boston, MA, 2015, pp. 191–226.
-  J. Masthoff, “Group Recommender Systems: Aggregation, Satisfaction and Group Attributes,” in Recommender Systems Handbook. Springer, Boston, MA, 2015, pp. 743–776.
-  Y. Shi, M. Larson, and A. Hanjalic, “Collaborative Filtering Beyond the User-Item Matrix: A Survey of the State of the Art and Future Challenges,” ACM Comput. Surv., vol. 47, no. 1, pp. 3:1–3:45, May 2014.
-  R. Pagano, P. Cremonesi, M. Larson, B. Hidasi, D. Tikk, A. Karatzoglou, and M. Quadrana, “The Contextual Turn: From Context-Aware to Context-Driven Recommender Systems,” in Proceedings of the 10th ACM Conference on Recommender Systems, ser. RecSys ’16. ACM, 2016, pp. 249–252.
-  R. Turrin, A. Condorelli, P. Cremonesi, and R. Pagano, “Time-based TV programs prediction,” in 1st Workshop on Recommender Systems for Television and Online Video at ACM RecSys, 2014, p. 7.
-  B. Jardine, J. Romaniuk, J. G. Dawes, and V. Beal, “Retaining the primetime television audience,” European Journal of Marketing, vol. 50, no. 7/8, pp. 1290–1307, Jul. 2016.
-  R. Larson and M. Csikszentmihalyi, “The Experience Sampling Method,” New Directions for Methodology of Social & Behavioral Science, vol. 15, pp. 41–56, 1983.
-  M. S. Kristoffersen, S. E. Shepstone, and Z.-H. Tan, “A Dataset for Inferring Contextual Preferences of Users Watching TV,” in Proceedings of the 26th Conference on User Modeling, Adaptation and Personalization, ser. UMAP ’18. ACM, 2018, pp. 367–368.
-  D. Saxbe, A. Graesch, and M. Alvik, “Television as a Social or Solo Activity: Understanding Families’ Everyday Television Viewing Patterns,” Communication Research Reports, vol. 28, no. 2, pp. 180–189, Apr. 2011.
-  J. Abreu, P. Almeida, B. Teles, and M. Reis, “Viewer Behaviors and Practices in the (New) Television Environment,” in Proceedings of the 11th European Conference on Interactive TV and Video, ser. EuroITV ’13. ACM, 2013, pp. 5–12.
-  K. Mercer, A. May, and V. Mitchel, “Designing for video: Investigating the contextual cues within viewing situations,” Personal and Ubiquitous Computing, vol. 18, no. 3, pp. 723–735, Mar. 2014.
-  J. Vanattenhoven and D. Geerts, “Contextual aspects of typical viewing situations: A new perspective for recommending television and video content,” Personal and Ubiquitous Computing, vol. 19, no. 5-6, pp. 761–779, Aug. 2015.
-  M. Csikszentmihalyi and R. Larson, “Validity and Reliability of the Experience-Sampling Method,” in Flow and the Foundations of Positive Psychology. Springer, Dordrecht, 2014, pp. 35–54.
-  M. Kim, J. Kim, S. Han, and J. Lee, “A Data-driven Approach to Explore Television Viewing in the Household Environment,” in Proceedings of the 2018 ACM International Conference on Interactive Experiences for TV and Online Video, ser. TVX ’18. ACM, 2018, pp. 89–100.
-  E. Frolov and I. Oseledets, “Tensor methods and recommender systems,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 7, no. 3, p. e1201, May 2017.
-  S. Rendle, “Factorization Machines,” in 2010 IEEE International Conference on Data Mining, Dec. 2010, pp. 995–1000.
-  S. Rendle, Z. Gantner, C. Freudenthaler, and L. Schmidt-Thieme, “Fast Context-aware Recommendations with Factorization Machines,” in Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’11. ACM, 2011, pp. 635–644.
-  H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide & Deep Learning for Recommender Systems,” in Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, ser. DLRS 2016. ACM, 2016, pp. 7–10.
-  M. Unger, B. Shapira, L. Rokach, and A. Bar, “Inferring Contextual Preferences Using Deep Auto-Encoding,” in Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization, ser. UMAP ’17. ACM, 2017, pp. 221–229.
-  L. Ardissono, C. Gena, P. Torasso, F. Bellifemine, A. Difino, and B. Negro, “User Modeling and Recommendation Techniques for Personalized Electronic Program Guides,” in Personalized Digital Television, ser. Human-Computer Interaction Series. Springer, Dordrecht, 2004, pp. 3–26.
-  M. Aharon, E. Hillel, A. Kagian, R. Lempel, H. Makabee, and R. Nissim, “Watch-It-Next: A Contextual TV Recommendation System,” in Machine Learning and Knowledge Discovery in Databases, ser. Lecture Notes in Computer Science. Springer, Cham, Sep. 2015, pp. 180–195.
-  M. Bambia, M. Boughanem, and R. Faiz, “Exploring Current Viewing Context for TV Contents Recommendation,” in 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), Oct. 2016, pp. 272–279.
-  Y. Park, J. Oh, and H. Yu, “RecTime: Real-Time recommender system for online broadcasting,” Information Sciences, vol. 409-410, pp. 1–16, Oct. 2017.
-  S. E. Shepstone, Z.-H. Tan, and S. H. Jensen, “Using Audio-Derived Affective Offset to Enhance TV Recommendation,” IEEE Transactions on Multimedia, vol. 16, no. 7, pp. 1999–2010, Nov. 2014.
-  S. H. Hsu, M.-H. Wen, H.-C. Lin, C.-C. Lee, and C.-H. Lee, “AIMED- A Personalized TV Recommendation System,” in Interactive TV: A Shared Experience, ser. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, May 2007, pp. 166–174.
-  E. Vildjiounaite, V. Kyllönen, T. Hannula, and P. Alahuhta, “Unobtrusive dynamic modelling of TV programme preferences in a Finnish household,” Multimedia Systems, vol. 15, no. 3, pp. 143–157, Jul. 2009.
-  S. Song, H. Moustafa, and H. Afifi, “Advanced IPTV Services Personalization Through Context-Aware Content Recommendation,” IEEE Transactions on Multimedia, vol. 14, no. 6, pp. 1528–1537, Dec. 2012.
-  P. Cremonesi, P. Modica, R. Pagano, E. Rabosio, and L. Tanca, “Personalized and Context-Aware TV Program Recommendations Based on Implicit Feedback,” in E-Commerce and Web Technologies, ser. Lecture Notes in Business Information Processing. Springer, Cham, Sep. 2015, pp. 57–68.
-  F. Lorenz, J. Yuan, A. Lommatzsch, M. Mu, N. Race, F. Hopfgartner, and S. Albayrak, “Countering Contextual Bias in TV Watching Behavior: Introducing Social Trend As External Contextual Factor in TV Recommenders,” in Proceedings of the 2017 ACM International Conference on Interactive Experiences for TV and Online Video, ser. TVX ’17. ACM, 2017, pp. 21–30.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay, “Scikit-learn: Machine Learning in Python,” Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2825–2830, 2011.