Log In Sign Up

VoCoG: An Intelligent, Non-Intrusive Assistant for Voice-based Collaborative Group-Viewing

by   Sumit Shekhar, et al.

There have been significant innovations in media technologies in the recent years. While these developments have improved experiences for individual users, design of multi-user interfaces still remains a challenge. A relatively unexplored area in this context, is enabling multiple users to enjoy shared viewing (e.g. deciding on movies to watch together). In particular, the challenge is to design an intelligent system which would enable viewers to explore together shows or movies they like, seamlessly. This is a complex design problem, as it requires the system to (i) assess affinities of individual users (movies or genres), (ii) combine individual preferences taking into account user-user interactions, and (iii) be non-intrusive simultaneously. The proposed system VoCoG, is an end-to-end intelligent system for collaborative viewing. VoCoG incorporates an online recommendation algorithm, efficient methods for analyzing natural conversation and a graph-based method to fuse preferences of multiple users. It takes user conversation as input, making it non-intrusive. A usability survey of the system indicates that the system provides a good experience to the users as well as relevant recommendations. Further analysis of the usage data reveals insights about the nature of conversation during the interaction sessions, final consensus among the users as well as ratings of varied user groups.


page 1

page 7


The Sample Complexity of Online One-Class Collaborative Filtering

We consider the online one-class collaborative filtering (CF) problem th...

soc2seq: Social Embedding meets Conversation Model

While liking or upvoting a post on a mobile app is easy to do, replying ...

Auto-detecting groups based on textual similarity for group recommendations

In general, recommender systems are designed to provide personalized ite...

Defining and Quantifying Conversation Quality in Spontaneous Interactions

Social interactions in general are multifaceted and there exists a wide ...

Submitting surveys via a conversational interface: an evaluation of user acceptance and approach effectiveness

Conversational interfaces are currently on the rise: more and more appli...

Interpretable Aesthetic Analysis Model for Intelligent Photography Guidance Systems

An aesthetics evaluation model is at the heart of predicting users' aest...

I Introduction

The realm of human-computer interaction has vastly expanded with the technologies for immersive experience making great strides

[1]. Moreover, there has been a huge shift in the media consumption, with a large population shifting online for personalized consumption of media content, like video or music. Hence, there is a growing need for innovation in design of human-computer interaction techniques to provide a seamless immersive experience for media consumption [2, 3].

A challenging design problem in this context is social/collaborative viewing, that aims to allow remotely located users to enjoy shared viewing of media content in a way that they feel being seated together, like conventional group viewing. The impact of group viewing on improving viewing experience has been well studied in television research [4, 5]. The work by [6] and [3] formalized the concept of remote social viewing. [6] designed a SocialTV experiment to investigate how groups behave when watching a program together. [3] built CollaboraTV, which incorporated user collaboration while watching television through messaging and shared interest profiles. In a large scale study of online sports viewing experience, Mo et al. [7] demonstrated the effectiveness of sharing thoughts and information, and desire to be belonging to a group for improving the watching experience. Further, McGill et al. [8] built a synchronous shared at-a-distance smart TV system, and analyzed the adoption of the system and the nature of communication. They also built a prototype in VR for shared viewing and showed its effectiveness in enhancing the viewing experience. Commercially, and support synchronized viewing of broadcasted content. Otherwise too, most of the online video platforms support some form of social interaction. For example, Facebook Live allows user to ”like” a live video, whereas Hulu enables users to edit and share video clips with other. The social functionality also helps users in content discovery on the platforms.

However, the design of an interface, which could meaningfully enable remote viewers to explore and decide video content they would like to watch together, has not been looked into extensively. While previous work enable remote users in a collaborative viewing session to communicate through chat, voice or video, there has been little focus on developing interfaces which would enhance content discovery experience in such scenarios. To this end, we present VoCoG, an intelligent system for voice-based collaborative group-viewing. The proposed system attempts to address various challenges in achieving a seamless content discovery experience in collaborative viewing settings.

Firstly, VoCoG incorporates voice as a medium of interaction between users. This is non-intrusive as the users are not required to type or click, and is particularly suited for immersive interfaces [9, 10]

. Further, natural user conversations allow VoCoG to extract rich user feedback (like movie, star affinity, expressed sentiments, etc.) using advanced natural language processing techniques. Moreover, even though, popular personal assistants like Siri or Alexa are built out of voice-based interfaces, we believe that there has been limited work in voice-driven feedback-based recommendations in multi-user interaction systems.

VoCoG deploys an online recommendation algorithm, which could efficiently update user preferences based upon the complex feedback from conversations. Conversation [11] and critique-based [12] and online [13, 14, 15] recommendation methods have started gaining attention recently. We exploit insights from the recent work to build an online recommendation system, which computes the recommended movies for each individual, based upon the feedback from his/her conversations.

Finally, the challenge is to how to combine the recommendations for each individual into the final watch list for the group as a whole. VoCoG uses the concepts for group behavior modeling in social network [16], as well as for group-based recommendations [17, 18] and takes into account for user-user agreements/disagreements, individual affinities towards movies, shows or stars as well as user behavioral traits, to arrive at meaningful recommendations for the group to watch. VoCoG can also detect if the group has reached a consensus on watching a video or not.

The paper is divided into six sections. Section II describes the related work in the area. The details of the design of the proposed interface, VoCoG are in Section III, while Section IV describes the final prototype. A comprehensive user evaluation of the system is discussed in Section V followed by the conclusions in Section VI.

Fig. 1: Workflow diagram for VoCoG, an intelligent interface for collaborative group-viewing.

Ii Related Work

In this section, we describe the related work in the design of multi-user interface, recommendations, conversation analysis and multi-user interaction modeling.
Multi-user Interfaces: There has been extensive research in the design of multi-user interfaces [19, 20]. Virtual presence [10] as well as design of virtual world using avatars [1] have been studied. The use of voice for human-machine interaction [21, 22] as well as immersive media [23] has been studied. There has been also related work in the domain of design of interactive shared viewing experience [6, 3, 24]. However, these approaches provide restrictive interaction mechanism between through chats or avatars. The proposed approach however is designed to account for rich conversation between users, and provide a seamless non-intrusive experience to the users.
Recommendations: A comprehensive survey of recommendation algorithms has been done by [25]. There have been work for group-based recommendations [26, 17, 18]. Conversation-based [11, 27] as critique-based recommendations [12] have been studied. Online recommendations techniques like bayesian [28], bandits [14], latent analysis [13] have been proposed recently. Our approach is motivated by [13] to use user-user clustering for updating individual preferences. This allows VoCoG to account for complex updates from user conversation, while outputting relevant recommendations.
Conversation Analysis: There has been extensive work in analyzing natural user conversation. The existing methods describe method for language parsing [29, 30], text tagging [31] and entity recognition [32]

. There has also been considerable work in sentiment analysis

[33, 34] and intention mining [35, 36] from text. Further, Mikolov et al [37] have looked into robust semantic representation of words. Commercially, applications like provide services for entity and intent extraction. VoCoG requires comprehensive parsing of conversation data, including entity extraction, sentiment analysis as well as parsing direct/indirect references in the conversation sequence. Prior work do not address this sufficiently. Hence, we build upon the existing work to analyze user conversation, and extract the required information.
User-User Interaction Modeling: There has been work in group behavior modeling in social networks [16]. However, the area of small group conversation is relatively unexplored. Prior work has looked the problems of conflict resolution [38], identifying speaker [39] and addressee [40] and modeling face-to-face conversations [41]. We address the challenge of conflict modeling in multi-user conversations through a novel user-user graph.

Iii System Implementation

In this section, we will describe the modeling for VoCoG, the proposed intelligent assistant interface. The workflow for the approach is described in Figure 1. The essential components of the system include an online recommendation algorithm, a module to understand the voice conversation between users and inter-user interaction modeling. Each of the these modules are described in details below.

Iii-a Recommendation System

VoCoG combines a novel incremental collaborative filtering as well as content filtering-based techniques to arrive at a robust ranking of show preferences for individual users. Thereafter, algorithm to update ratings based upon the user conversations is discussed. Note that how the group recommendations are arrived at, will be described later in Section III-C.

Iii-A1 Movie Database

We used MovieLens [42] dataset for training the recommendation system. The dataset has about million ratings for K movies by around million users. We pruned out users with rating less than movies and movies having less than ratings, leaving around K users, K movies and M ratings. This was done to reduce the movie search space during updating VoCoG recommendations. Moreover, relatively unknown movies, not rated by enough users, would not generated conversation among users. The dataset was further enriched, through crawling the web, with the genre terms, actors and directors for each of the final K movies. This enriched data was used for training the VoCoG recommendation models.

Iii-A2 Collaborative Filtering

We chose to deploy a simple probabilistic latent analysis (probLat)-based method for collaborative filtering, but describe a method inspired by [13] to efficiently incorporate complex feedback from user conversations (Section III-A4). For a user , his rating , for a movie , was modeled as a function of and movie . The latent variable was introduced to decouple probabilistic dependency between users and the movies ratings. Different user interest groups were captured in and hence, the rating of a movie for a user was calculated as:


Each user belonged to a cluster

with probability

and distribution of ratings across clusters was given by . Distribution, , is modeled as:


Expectation Maximization Algorithm: For training the probLat model on the MovieLens dataset, an EM algorithm [43]

was used. The E-step calculated the posterior probability of

, given the user , movie and rating as:


Once the posterior probabilities were computed, the M-step computed probability of user belonging to different clusters and parameters for the distributions:


Log-likelihood was used to measure convergence of the algorithm.The algorithm was terminated when change in the log-likelihood went below of the log-likelihood at that step.

Iii-A3 Content Filtering

Content filtering was done through the nearest neighbor approach. Based on the movie rating, given by user , scores for a genre and a star were calculated as follows:


where is set of movies containing genre and is set of movies in which star has acted. Both and were normalized with respect to the list of genres and stars respectively. Content-based score of a movie for the user was now calculated by averaging the scores for the genres in the movie and the scores for the stars present in the movie.

Fig. 2: Variation in the ranked lists of movies across four clusters.

Iii-A4 Incorporating user preferences

Here we describe the model for updating recommendations based upon conversation-based feedback. The model can account for feedback for movies, stars or genres from the user.
1. Ranked Movie List: Once the probLat model was trained, list of movies was ranked in the descending order of for each user interest group, . The mean rating

and variance

of movie, varied with cluster . Hence, as shown in the Figure 2, each cluster has different movies at the top. The top movies from each cluster were used for the next step.
2. Calculating Genre Scores: For each cluster , scores for different genre terms were calculated by averaging the predicted rating for movies containing the genre term, present in the ranked list. The list of genres was created from the tagged MovieLens data. The genre terms were then ranked in the descending order of scores for each cluster to get a cluster specific ranked list, . Figure 3 shows the variation of genre scores across clusters. The cluster-specific genre scoring was used for updating user preferences.
3. Incorporating feedback using : We exploited the difference in movie or genre preference across clusters to incorporate conversation feedback, through modifying interest group probability, . Different distributions of led to generation of different movies as recommendations from the model. We extracted keywords like genre, movie names, stars from the user conversations, along with attached sentiment as described in Section III-B. Here, we describe how to update using the genre preference of the user, but it can be extended to movies or stars as well.

Let be (genre term, sentiment value) pair extracted for a particular user conversation. For example, for conversation like ”Right now, I am in mood for action movies”, the pair would be (action, ). Then, is updated as follows:




 is a hyperparameter.

The value of the factor lied between , and the exponential ensured that updates can be done serially. The update worked as follows: if a genre, like action, ranked higher in clusters , and than others and user expresses a positive sentiment about it, then the probability of user being in cluster , and will be increased. Higher the rank of action in a cluster, greater will be the update factor for the cluster. Similar equations were used to update using movie and stars keywords. This was repeated for each extracted keyword-sentiment pair. is normalized after all the terms have been processed.
4. Updating content filtering: We updated content based preference based on the input from conversation for genre preference as follows. Other terms can be similarly taken care of.

Fig. 3: Variation of Scores of six genres viz. musical, horror, documentary, sci-fi and fantasy across four clusters.

Iii-A5 Implementation and Results

For implementation, the number of clusters in probLat model was taken to be and the hyperparameter was set to empirically. The final recommendations were arrived by a simple average of probLat and content filtering scores. The probModel was also compared with different methods in literature (Table I). For the testing, rating of a random movie among the movies rated by each user was removed. The model was then trained on the reduced data set and tested on the movies removed from the dataset. It can be seen that the probLat method shows comparable performance with some of the previous methods.

Method Mean Absolute Error
probLat (clusters = )
SVD [44]
k-NN [44]
TABLE I: Comparison of recommendation accuracy for the probLat model.

Iii-B Natural Language Understanding

It is challenging to process human language, more so when the people are conversing. For incorporating non-intrusive feedback in VoCoG, it was required to design a workflow which could update the viewer preferences solely based on their conversation. For this, we broke down the entire complex conversation to simpler keyword - sentiment pairs, which could then be used to update user preferences as discussed in Section III-A. The keywords included movie mentions, named entities like stars or directors and mention of genre terms. The process is discussed in described in details below.

Iii-B1 Speech-to-text Conversion

The user conversations were first converted to text using existing APIs [45]. Though the accuracy of speech-to-text APIs have increased considerably, there are accompanying challenges in further processing as described below. The conversations were analyzed sentence-wise.

Iii-B2 Conversation Database

A database of user conversations (MovieForum) was curated from a movie discussion forum ( The dataset created had different threads and an average of comments in each thread and each thread involved users on an average. The conversations were manually labeled with movie, genre and actor mentions. We also further tagged the conversation with the mentions of user, who are involved in the discussion. These tagging can be either direct mentions of a user/star or indirectly through the use of pronouns, etc. Each of the conversations were further labeled manually with a sentiment value (, or ).

For the purpose of evaluating different tagging and sentiment detection approaches, we used train/test split of the corresponding dataset. The hyper-parameters were trained using -fold cross validation scheme. In the cases where no training was required, full dataset was used for evaluation.

Iii-B3 Sentence-Level Keyword Extraction

We describe below the proposed methods for extraction of different types of keywords, like genre, movie, actor.
1. Genre Terms: Most of the genre terms in the curated MovieLens database from Section III-A

(like action, drama) were single words. So, the genre terms were extracted using simple word search. The look-up list of genres was compiled from the movie database. The method gave an F-score of around

on the MovieForum database.
2. Movies Terms: Movie names were more complex like ”One Flew over Cuckoo’s nest”. Also, there was a comprehensive movie list to search for (around 10k for our MovieLens dataset). Hence, a two-step process was used for extraction:
a. Movie Tagging: Alternate methods of tagging potential movie mentions were compared for this purpose.
Baseline approach: The existing, state-of-the-art POS tagging method [29] was used to detect nouns from the sentences, and then use the detected parts as the tagging for movie mentions.
Learning-based approach: The training data from the MovieForum data was IOB-tagged [31]

. The features used for training gradient-boosted classifier included the POS tags of current as well as that for words in a window of length

around the word, position of the word (e.g., first or last word), a vector representation of word provided by word2vec model

[37] and if the word is among top most frequent word in movie name list (MovieLens data). The output of this classifier was smoothened using an HMM-based sequence analyzer, trained on the MovieForum data with I,O,B as hidden states. This was done to weed out some of the unlikely classifications done by the classifier. The method overcame challenges of unreliable capitalization and could detect long names as well. The performance of the tagging approaches are summarized in Table II.

Precision Recall
Baseline [29]
Proposed Approach
TABLE II: Results for movie tagging detection.

b. Movie Name Search: The tagged output was then matched with the movie names in the MovieLens database using a string search, based on the Levenshtien distance measure. The top ranked movies were then re-ranked on the basis of the context of the conversation. Context included genre or actor detected in the previous conversation. The scores for the movies related to these mentions were increased, and then ranked accordingly. Table III shows the overall performance of the movie extractor.

Precision Recall
Baseline [29]
Proposed Approach
TABLE III: Results for movie name recognition.

3. Movie Stars Terms: Movie stars were detected following the method used for movies. The stars tagging method was compared to the Stanford name-entity tagger [32]. It can be seen in Table IV that the proposed approach outperforms the recall of the Stanford tagger, with only a small decrease in precision.

Precision Recall
Stanford Name Tagger [32]
Proposed Approach
TABLE IV: Results for star name recognition.

4. Indirect references: References to a movie or a star using determiners like it or him/her were attached to the last mention of a movie or a star, detected from the conversation.

Iii-B4 Sentence-Level Sentiment Analysis

The existing sentiment analysis methods were found insufficient for our case. They did not classify sentiment for intent well, e.g. ”We should be watching Inception” was classified as a neutral sentiment. They also did not take care of sentences framed as questions, e.g. ”Why shouldn’t we watch inception?”. The baseline approaches assigned negative sentiment to the sentence. There were cases like the negative sentiment being assigned due to the movie name itself, e.g. ”Let us watch Wrong Turn”. Hence, a modified sentiment analyzer was trained. Features included - 1. if the sentence is a question or not, 2. presence of words indicating intention , positive or negative words [35], 3. average representation of sentiment and intention words given by word2vec model [37] and 4. scores of existing sentiment classifiers [33, 34]. Also to avoid the problem of a keyword (movie or actor) altering the sentiment, the positive or negative keywords were removed. The performance comparison of the developed sentiment analyzer for the MovieForum data is provided in Table V.

Accuracy (%)
NLTK Sentiment Classifier [33]
Text Blob Classifier [34]
Proposed Approach
TABLE V: Results for sentiment analysis on the MovieForum dataset.

Iii-B5 Keyword-Sentiment Pairing

The last step was to attach sentiment to the extracted keywords. The direct approach was pair the sentence sentiment with the corresponding keywords. However, in conversations, people can mention multiple movies in a sentence, with contrasting sentiment. Hence, we used a set of linguistic-based rules to improve the pairing, as described below.

  • The sentence was parsed using a constituency parser [30] and a set of rules were created to attach the sentiment to the keyword.

  • A total of 20 rules were created for comparative words like ”but”, ”and”, ”or”, ”yet”, ”although”, ”both … and”, ”instead”, ”as … as”, ”than”. E.g. the rule for ”but” was: In the constituency parse tree, if the parent of ”but” conjunction is a noun phrase, attach the reverse sentiment of the part containing the verb phrase to the part which does not contain the verb phrase.

The final results for keyword-sentiment pairing are shown in TableVI.

Precision Recall
Direct Pairing
Modified Pairing 0.68 0.67
TABLE VI: Comparison of results for keyword-sentiment pairing.

Iii-C Inter-User Influence Modeling

In this section we describe the modeling of inter-user influence from conversation. We explain how the ratings of users vary due to agreement or conflict during the conversation. We create a user-user graph, based upon related work in social networks [16]. The algorithm assumes the knowledge of the user names of people present in the conversation, and takes the keyword-sentiment pairs, extracted in Section III-B, as the input.
1. Dependency Parsing: First parts of speech tags (POS tags) were detected using dependency parsing [29], and which were then used to detect the subject of conversation. The keywords like movie, actors and the corresponding sentiment were extracted as explained in Section III-B. The detected (subject, keyword, sentiment) tuples were outputted.
2. Keyword pruning: In conversation, there would be cases in which references to a movie or star can not be linked to another user. An example would be ”I want to watch The Prestige”. ”I-The Prestige” would be the user-keyword pair obtained from this, but in case The Prestige was not referred to before by any user, it would not convey agreement or disagreement with any other user. These keywords were pruned out.
3. Inter-User Sentiment: We now find the expressed sentiment for interaction between users. There are two possible cases here:

  • If the subject was not detected, the user who last used the particular keyword was taken as the referred user. The agreement or disagreement (i.e. the sentiment of interaction) is determined by whether their expressed sentiments matched or not.

  • If the keywords and subject were not detected from the sentence, then the following method was used. We assumed that people talking about what someone else has talked about, tend to bring up similar topics. Hence, we find the overlap of noun words between the sentences of the user as well as recent sentences spoken before. The user who last spoke the maximum overlapping sentence was assigned as the user referred. In case there was no overlap, the speaker of the sentence previous to the current one was taken to be the referred user.

5. User-User Influence Graph: The sentiment for ordered user pair from conversation was used to update the graph. The weight was assigned to be the extracted sentiment. Note that the graph is not symmetrical, as user agreeing or disagreeing with user changes , but not . For multiple conversations, the sentiment for each one was added to the corresponding weight value.
6. User Rating Matrix update: The rating, for a movie, by user, was updated using the user-user influence graph as follows:


where is a regularization parameter. In our tests, was set to be the number of users. The update brought the rating of users in agreement closer together, so as to arrive at consensus quicker. Negative weight edges in the graph were not used in the update. However, the negative weights were maintained so that users, who were in prior disagreement, must come to agreement before the correspond edge weight to be taken into account.
7. Limitations: Our subject analysis method may fail in case of complex movie names, for example ”Who Is Harry Kellerman and Why Is He Saying Those Terrible Things About Me?”. If this movie is part of a sentence, naturally ”He” will be detected as a subject, as well as a pronoun, and this will lead to a result that suggests the presence of an inter-user interaction, although there may not be.

Fig. 4: User-user graph for modeling inter-user interactions.

Iii-C1 Results

The user influence modeling system showed good performance on the MovieForum dataset. For a set of about users in the database with agreements, disagreements or neutral exchanges between any two users, the algorithm had and .

Questions Strongly Disagree(1) Disagree(2) Neutral(3) Agree(4) Strongly Agree(5) p-value
Overall VoCoG provided good experience
Final recommendations were good
Updates to recommendations were appropriate
System took care of your preferences
Response time of the system was fast enough
System was non-intrusive
TABLE VII: Results of the conducted survey of VoCoG, the proposed collaborative viewing system on the Likert Scale of , with each cell denoting the fraction of responses. On all the components except the response time, more than participants showed agreement or strong agreement. p-value for Wilcoxon test with hypothesis of being greater than rating of is shown in the last table.
Fig. 5: Snapshots of the interface of a prototype of the proposed system.

Iii-D Group Consensus Function

Finally, to arrive at the recommendations for the whole group, a group consensus function was used. A variety of group consensus functions like maximum pleasure, average satisfaction , least misery, etc. have been explored in the group recommendation literature [17, 18].
1. Average without misery function: We used ”average without misery” function for our case. This function first eliminates movies on the basis of ”misery”, i.e. if any user has rated a movie below a threshold, then the movie is eliminated. For the surviving movies, average rating is computed for each movie, based on which the top movies are decided. In our experiments, the threshold for misery was decided empirically. The final decision was taken using a weighted average rating computation. The weights were decided upon by user behavior in the conversation, like sentences spoken and users influenced.
2. Consensus Detection: The system decided whether the users have reached a consensus by comparing the top rated movie with the lower ranked ones. If the overall rating of the movie for the group exceeded the next movie in the list by a specified threshold (set to times in our experiment), then the consensus was deemed to have been reached.

Iv Working Prototype

Figure 5 shows snapshots of a working prototype of the system. As shown in the figure, first the users are asked to login into the system. VoCoG waits till all the users have joined the session, to enable a synchronized experience. Once all the users have logged in, VoCoG generates initial recommendations based on the users’ previous histories, and outputs a voice message as well as a text on the screen. In the prototype, the number of movie options shown was kept at so as to generate conversation about each option.

The users can then converse among themselves using ”Record” and ”Send” buttons. This is similar to the interfaces in many voice-based assistant systems. The interface also helps in sequencing the user conversation seamlessly. The sentences spoken by the users are sent to other users and also the back-end server in real time. The users are represented by avatars, and as they speak, the corresponding avatars light up.

After a fixed time interval, VoCoG refreshes the recommendations and an ”Updating recommendations” message is played as well as shown on the screen. The next set of recommendations are then displayed, which is based on processing the user conversation following the method described in Section III. The process continues until the consensus is detected, as shown in the figure, where users have converged upon the movie ”The Dead Zone”. The movie is then played when all the users click the video icon. In case the consensus is not reached after five rounds of updates, the top rated movie is shown as the final output.

V System Evaluation

For a comprehensive evaluation of the proposed system, VoCoG was made to interact with users. About people ( females) were involved in a survey to judge the performance of VoCoG. The participants in the survey were drawn from the age group of , and had varied movie watching preferences. The system was then measured on different parameters following the methods for user-centric recommender system evaluation [46, 47].

V-a Survey Design

The survey was conducted as follows:
Phase 1: In the first phase the participants were asked to rate a set of popular movies from the MovieLens database. Movies to be rated were chosen to be representative of different genres. These ratings were used to train the model combined with the movie dataset (Section III-A1). Each subject rated movies on an average. This gave data to VoCoG to create an initial profile of the subject. The subjects were also asked if they were frequent movie watcher (more than 2 times a week), and if they are usually active in conversation.
Phase 2: In the second phase these subjects were grouped in groups of , and groups were formed. These groups were then called upon to interact with the system. Thereafter, they rated the system on six parameters, as shown in Table VII, on a Likert scale of , where represents the worst rating and the best. Arrangements were made to have an environment identical to the one which the viewers would experience in a remote collaborative viewing implementation. All the three viewers were made to sit in different rooms and could interacting only through the system. VoCoG listened to their conversations, and updated the recommendations periodically.
Questionnaire: After the interaction, the participants were made to fill a questionnaire. Here, they rated the different aspects of interaction with the system (shown in Table VII) on a Likert scale of (Strongly disagree) - (Strongly Agree).

V-B Survey Analysis

Here, we analyze different aspects of interaction of participants with VoCoG.
1. Questionnaire Response: The summary of the questionnaire responses is shown in Table VII. As can be seen, VoCoG received strongly positive response (more than participants agreed or strongly agreed) for all the parameters (recommendation quality, interactivity, non-intrusiveness) except response time. This is because VoCoG searches through a large movie dataset for recommendations. We hope to improve the system response time in future implementations.
2. Conversation analysis: Table VIII shows the statistics of average mentions of different entities in the survey. It can be seen that the participants conversed the most about movies, followed by genres and actors/stars. There were also considerable agreements/disagreements between the participants while interacting. Overall they participated well in the survey, with number of sentences spoken per update being around .
3. Group Recommendation response: Table VIII also shows the statistics for user responses to the recommendations provided by VoCoG. As the users can provide feedback through conversation, different aspects of the response are required to be captured (different from click-based systems). As shown in the Table VIII, about movie mentions out of the total average mentions per update cycle were regarding the recommended movies. Overall on an average out of movies were unique per update. This shows that while the users discussed the recommended movies, they also looked out for diverse recommendations. The statistics for genre term mentions ( out of on an average were from recommended list) indicate that the users expressed more conveniently in terms of their genre choices. Actors and directors were mentioned only few times. Also, VoCoG was able to reach consensus for only out of groups. This calls for a need for better modeling for group consensus and understanding user dynamics. We intend to study these as future directions to the work.
4. Variations due to user differences: We also studied how the nature of participants, viz. frequent/non-frequent and active/non-active (as collected in the Phase 1) affected their interaction with the system. As shown in Table IX, frequent and active participants rated VoCoG highly on overall experience, but there were some lower ratings by non-frequent and non-active participants.

Entities Avg. number per update
Sentences spoken
Movie mentions
Actors/Directors mentioned
Genre Terms
Unique movies recommended
Recommended Movies mentions
Recommended Genre mentions
Recommended Actors mentions
User Agreement/Disagreement
TABLE VIII: Analysis of average mentions of different entities in the interaction of participants with VoCoG, between consecutive recommendation updates.
Participant SD D N A SA
Frequent, Active
Freuent, Non-Active
Non-frequent, Active
Non-frequent, Non-active
TABLE IX: Ratings statistics for different participant groups (frequent/non-frequent movie watchers, active/non-active in conversation) on the overall experience with VoCoG being good.

Vi Conclusion and Future Directions

In this paper, we have described framework for VoCoG, an intelligent, non-intrusive interface for collaborative group-viewing experience. We have described the technology behind each components of VoCoG, viz. an online recommendation system, a robust conversation analyzer and a user-user interaction modeling algorithm.

In the future, we plan to optimize the system for an efficient response time. We also need to expand the scope of the algorithms to update user preferences beyond the session, for longer-term viewing experience optimization and incorporate better features for user dynamics and consensus modeling. We further plan to incorporate a richer GUI, using avatars and augmented sound to improve the experience.


  • [1] J. Blascovich and J. Bailenson, Infinite reality: Avatars, eternal life, new worlds, and the dawn of the virtual revolution.   William Morrow & Co, 2011.
  • [2] P. Cesar and K. Chorianopoulos, “The evolution of tv systems, content, and users toward interactivity,” Foundations and Trends in Human-Computer Interaction, vol. 2, no. 4, pp. 373–95, Apr 2009.
  • [3] M. Nathan, C. Harrison, S. Yarosh, L. Terveen, L. Stead, and B. Amento, “Collaboratv: making television viewing social again,” in Proceedings of the 1st international conference on Designing interactive user experiences for TV and video.   ACM, 2008, pp. 85–94.
  • [4] J. Lull, “The social uses of television,” Human communication research, vol. 6, no. 3, pp. 197–209, 1980.
  • [5] J. G. Webster and J. J. Wakshlag, “The impact of group viewing on patterns of television program choice,” Journal of Broadcasting & Electronic Media, vol. 26, no. 1, pp. 445–455, 1982.
  • [6] N. Ducheneaut, R. J. Moore, L. Oehlberg, J. D. Thornton, and E. Nickell, “Social tv: Designing for distributed, sociable television viewing,” Intl. Journal of Human-Computer Interaction, vol. 24, no. 2, pp. 136–154, 2008.
  • [7] M. Ko, S. Choi, J. Lee, U. Lee, and A. Segev, “Understanding mass interactions in online sports viewing: Chatting motives and usage patterns,” ACM Trans. Comput.-Hum. Interact., vol. 23, no. 1, pp. 6:1–6:27, Jan. 2016.
  • [8] M. McGill, J. H. Williamson, and S. Brewster, “Examining the role of smart tvs and vr hmds in synchronous at-a-distance media consumption,” ACM Trans. Comput.-Hum. Interact., vol. 23, no. 5, pp. 33:1–33:57, Nov. 2016.
  • [9] J. Steuer, “Defining virtual reality: Dimensions determining telepresence,” Journal of communication, vol. 42, no. 4, pp. 73–93, 1992.
  • [10] M. V. Sanchez-Vives and M. Slater, “From presence to consciousness through virtual reality,” Nature Reviews Neuroscience, vol. 6, no. 4, pp. 332–339, 2005.
  • [11] K. Christakopoulou, F. Radlinski, and K. Hofmann, “Towards conversational recommender systems,” in Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16.   New York, NY, USA: ACM, 2016, pp. 815–824.
  • [12] L. Chen and P. Pu, “Critiquing-based recommenders: survey and emerging trends,” User Modeling and User-Adapted Interaction, vol. 22, no. 1, pp. 125–150, 2012.
  • [13] G. Bresler, G. H. Chen, and D. Shah, “A latent source model for online collaborative filtering,” in Advances in Neural Information Processing Systems, 2014, pp. 3347–3355.
  • [14] X. Zhao, W. Zhang, and J. Wang, “Interactive collaborative filtering,” in Proceedings of the 22nd ACM international conference on Conference on information & knowledge management.   ACM, 2013, pp. 1411–1420.
  • [15]

    J. Kawale, H. H. Bui, B. Kveton, L. Tran-Thanh, and S. Chawla, “Efficient thompson sampling for online matrix-factorization recommendation,” in

    Advances in Neural Information Processing Systems, 2015, pp. 1297–1305.
  • [16] W. Sherchan, S. Nepal, and C. Paris, “A survey of trust in social networks,” ACM Computing Surveys (CSUR), vol. 45, no. 4, p. 47, 2013.
  • [17] J. Masthoff, “Group recommender systems: Combining individual models,” in Recommender systems handbook.   Springer, 2011, pp. 677–702.
  • [18] K. McCarthy, M. Salamó, L. Coyle, L. McGinty, B. Smyth, and P. Nixon, “Group recommender systems: a critiquing based approach,” in Proceedings of the 11th international conference on Intelligent user interfaces.   ACM, 2006, pp. 267–269.
  • [19] C. Carlsson and O. Hagsand, “Dive a multi-user virtual reality system,” in Virtual Reality Annual International Symposium, 1993., 1993 IEEE.   IEEE, 1993, pp. 394–400.
  • [20] P. Curtis and D. A. Nichols, “Muds grow up: Social virtual reality in the real world,” in Compcon Spring’94, Digest of Papers.   IEEE, 1994, pp. 193–200.
  • [21] T. Igarashi and J. F. Hughes, “Voice as sound: using non-verbal voice input for interactive control,” in Proceedings of the 14th annual ACM symposium on User interface software and technology.   ACM, 2001, pp. 155–156.
  • [22] P. R. Cohen and S. L. Oviatt, “The role of voice input for human-machine communication,” Proceedings of the National Academy of Sciences, vol. 92, no. 22, pp. 9921–9927, 1995.
  • [23] M. C. Salzman, C. Dede, R. B. Loftin, and J. Chen, “A model for understanding how virtual reality aids complex conceptual learning,” Presence: Teleoperators and Virtual Environments, vol. 8, no. 3, pp. 293–316, 1999.
  • [24] V. Becker, “Interactive television experience in convergent environment: Models, reception and business,” in Proceedings of the ACM International Conference on Interactive Experiences for TV and Online Video.   ACM, 2016, pp. 119–122.
  • [25] G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions,” IEEE transactions on knowledge and data engineering, vol. 17, no. 6, pp. 734–749, 2005.
  • [26] S. B. Roy, S. Thirumuruganathan, S. Amer-Yahia, G. Das, and C. Yu, “Exploiting group recommendation functions for flexible preferences,” in 2014 IEEE 30th International Conference on Data Engineering.   IEEE, 2014, pp. 412–423.
  • [27] H. Wu, Y. Wang, and X. Cheng, “Incremental probabilistic latent semantic analysis for automatic question recommendation,” in Proceedings of the 2008 ACM Conference on Recommender Systems, ser. RecSys ’08.   New York, NY, USA: ACM, 2008, pp. 99–106.
  • [28] D. H. Stern, R. Herbrich, and T. Graepel, “Matchbox: large scale online bayesian recommendations,” in Proceedings of the 18th international conference on World wide web.   ACM, 2009, pp. 111–120.
  • [29]

    D. Chen and C. Manning, “A Fast and Accurate Dependency Parser using Neural Networks,” in

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Association for Computational Linguistics, Oct. 2014, pp. 740–750. [Online]. Available:
  • [30] M. Zhu, Y. Zhang, W. Chen, M. Zhang, and J. Zhu, “Fast and accurate shift-reduce constituent parsing.” in ACL (1), 2013, pp. 434–443.
  • [31] L. A. Ramshaw and M. P. Marcus, “Text chunking using transformation-based learning,” arXiv preprint cmp-lg/9505040, 1995.
  • [32] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky, “The stanford corenlp natural language processing toolkit.” in ACL (System Demonstrations), 2014, pp. 55–60.
  • [33] S. Bird, “Nltk: the natural language toolkit,” in Proceedings of the COLING/ACL on Interactive presentation sessions.   Association for Computational Linguistics, 2006, pp. 69–72.
  • [34] S. Loria, “,” 2013.
  • [35] B. Liu, M. Hu, and J. Cheng, “Opinion observer: analyzing and comparing opinions on the web,” in Proceedings of the 14th International conference on World Wide Web.   ACM, 2005, pp. 342–351.
  • [36] Q. Liu, Z. Gao, B. Liu, and Y. Zhang, “Automated rule selection for aspect extraction in opinion mining,” in

    Proceedings of the 24th International Conference on Artificial Intelligence

    , ser. IJCAI’15, 2015, pp. 1291–1297.
  • [37]

    T. Mikolov and J. Dean, “Distributed representations of words and phrases and their compositionality,”

    Advances in Neural Information Processing systems, 2013.
  • [38] A. Pesarin, M. Cristani, V. Murino, and A. Vinciarelli, “Conversation analysis at work: detection of conflict in competitive discussions through semi-automatic turn-organization analysis,” Cognitive processing, vol. 13, no. 2, pp. 533–540, 2012.
  • [39] O. Vinyals and G. Friedland, “Towards semantic analysis of conversations: A system for the live identification of speakers in meetings,” in Semantic Computing, 2008 IEEE International Conference on.   IEEE, 2008, pp. 426–431.
  • [40] N. Jovanović et al., “Towards automatic addressee identification in multi-party dialogues.”   Association for Computational Linguistics, 2004.
  • [41] D. Wyatt, T. Choudhury, J. A. Bilmes, and H. A. Kautz, “A privacy-sensitive approach to modeling multi-person conversations.” in IJCAI, vol. 7, 2007, pp. 1769–1775.
  • [42] F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,” ACM Trans. Interact. Intell. Syst., vol. 5, no. 4, pp. 19:1–19:19, Dec. 2015.
  • [43]

    T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,”

    Machine learning, vol. 42, no. 1-2, pp. 177–196, 2001.
  • [44] B. Mehta, T. Hofmann, and W. Nejdl, “Robust collaborative filtering,” in Proceedings of the 2007 ACM conference on Recommender systems.   ACM, 2007, pp. 49–56.
  • [45] J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, and B. Strope, ““your word is my command”: Google search by voice: A case study,” in Advances in Speech Recognition.   Springer, 2010, pp. 61–90.
  • [46] B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, and C. Newell, “Explaining the user experience of recommender systems,” User Modeling and User-Adapted Interaction, vol. 22, no. 4-5, pp. 441–504, Oct. 2012.
  • [47] P. Pu, L. Chen, and R. Hu, “A user-centric evaluation framework for recommender systems,” in Proceedings of the Fifth ACM Conference on Recommender Systems, ser. RecSys ’11.   New York, NY, USA: ACM, 2011, pp. 157–164.