Personalized News Recommendation with Context Trees

03/04/2013 ∙ by Florent Garcin, et al. ∙ 0

The profusion of online news articles makes it difficult to find interesting articles, a problem that can be assuaged by using a recommender system to bring the most relevant news stories to readers. However, news recommendation is challenging because the most relevant articles are often new content seen by few users. In addition, they are subject to trends and preference changes over time, and in many cases we do not have sufficient information to profile the reader. In this paper, we introduce a class of news recommendation systems based on context trees. They can provide high-quality news recommendation to anonymous visitors based on present browsing behaviour. We show that context-tree recommender systems provide good prediction accuracy and recommendation novelty, and they are sufficiently flexible to capture the unique properties of news articles.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With a growing number of online news stories, it has become difficult to find interesting articles. Recommender systems alleviate this issue by bringing the most relevant items to users. However, while such systems have been used with considerable success for products such as books and movies, they have found surprisingly little application in recommending news articles.

There are many challenges in finding the most relevant news stories of superior quality to recommend to readers. Firstly, articles must be recommended soon after they are written, hence leaving little time to collect data about their popularity. A second issue is that there is often little data available about a user’s past behaviour because many news sites are reached through search engines, so visitors cannot be identified. Finally, recommendations for news stories depend on a number of factors: the popularity of a news item, the freshness of the story, the topic and the sequence of news items or topics that the user has seen so far.

Current approaches [11, 17] apply techniques designed for product recommendation to the domain of news articles. However, in doing so they ignore the intrinsic properties of news stories. To overcome the lack of data about the users, they frequently rely on the history of logged-in users, which causes potential privacy issues to the users. We believe this is why many newspaper websites continue to recommend news articles using a simple most-popular approach.

Our contribution is a class of online recommendation algorithms based on Context-Tree (CT) models

. The online nature of the algorithms means that the model provides recommendations and is updated simultaneously and fully incrementally. Context trees are a versatile class of Bayesian statistical models. A CT model defines a partition tree in some space, where each subset of each partition is called a context. Each context is associated with a different local prediction model, called an expert, which are then combined to make predictions. It is important to select both the partition structure and the expert model appropriately.

In this work, we consider different spaces to partition: a) sequences of news, b) sequences of topics, and c)

topic distributions. The first two constructions estimate distributions of variable-order Markov models, and so concentrate on modelling the temporal characteristics of users’ behaviour. The last model instead makes predictions conditional on the distribution of topics preferred by a user. Each of these constructions results in a different behavioural model.

By assigning an expert to each context, the CT model defines a tree distribution on expert models. Each expert model gives predictions for a subset of the data space, i.e. the context the expert is responsible for. The expert predictions are combined to make recommendations. We tailor our expert models to the idiosyncrasies of news. More specifically, our expert models take into account the popularity and freshness of news items.

In all cases, the CT distribution admits a closed-form, incremental Bayesian inference procedure. Hence, in contrast to other methods, our approach can easily be employed online to simultaneously generate recommendations and update the model.

The questions we want to answer are whether a) a sophisticated expert model can improve recommendation quality, b) the temporal sequence is important for recommendation, c) the content helps in making good recommendations, and d) CT models give novel recommendations.

To answer these questions, we examine a scenario where users visit a website anonymously and only information about the current visit can be used to make recommendations. We obtained access logs from two newspaper websites: Tribune de Genève and 24 and (the most-popular newspapers in Cantons of Geneva and Vaud, respectively). As recommendations are difficult to evaluate with real users, we measure how well our recommendations match the news items that readers selected themselves.

We show that context-tree recommender systems have a robust performance, surpassing that of a standard approach over a wide range of parameters. In addition, we performed an independent unbiased test where we show that CT methods achieve good performance for both accuracy of prediction and novelty of recommendations.

The remainder of this paper begins with a brief review of the literature. In Section 3, we introduce the general idea of context-tree recommender systems, and define recommender systems building context tree in the sequence of items (Sec. 3.1.1), sequence of topics (Sec. 3.1.2) and an hybrid of the formers (Sec. 3.1.3). We also show how to take advantage of the topic distributions (Sec. 3.2). Section 3.3 characterizes the different expert models tailored for the domain of news. We present and discuss our results in Section 4. Finally, we conclude our work in Section 5.

2 Related Work

In this work, we use recommender systems to suggest relevant and interesting news articles to readers. In general, there are two classes of recommender systems [3]: collaborative filtering [31] which recommend items based on preferences of similar users, and content-based systems [22] which use content similarity of the items.

The earliest example where collaborative filtering is used for news recommendation is the Grouplens project which applies it to newsgroups [26, 18]. News aggregation systems such as Google News [11] also implement such algorithms. In their work, they use Probabilistic Latent Semantic Indexing and MinHash for clustering news items, and item covisitation for recommendation (i.e. where two news are clicked by the same user within a time frame). Their system builds a graph in which the nodes are the news stories and the edges represent the number of covisitations. Each of the approaches generates a score for a given news, which are aggregated into single score thanks to a linear combination.

Content-based recommendation is more common for news personalisation [8, 4, 17, 2]. NewsWeeder [19]

is probably the first content-based approach for recommendations, but applied to newsgroups. NewsDude

[8] and more recently YourNews [4]

implemented a content-based system. They both use Term Frequency-Inverse Document Frequency (TF-IDF) and the cosine similarity between TF-IDF vectors to generate the recommendations. NewsDude has a model for long-term interests and another for short-term interests. The long-term model represents news stories as Boolean feature vectors, where each feature indicates the presence/absence of pre-selected words. A naïve Bayesian classifier is used with these vectors. Short-term interests are captured by converting news stories into a TF-IDF vector and applying the

nearest neighbours algorithm with cosine similarity.

It is also possible to combine the two types in a hybrid system [10, 21, 20]. For example, Liu et al [21] extend the Google News study by looking at the user click behaviour in order to create accurate user profiles. They propose a Bayesian model to recommend news based on the user’s interests and the news trend of a group of users. They combine this approach with the one by Das et al [11] to generate personalized recommendations. Li et al [20] introduce an algorithm based on a contextual bandit which learns to recommend by selecting news stories to serve users based on contextual information about the users and stories. At the same time, the algorithm adapts its selection strategy based on user-click feedback to maximize the total user clicks.

Most of these works rely on the history of logged-in users. This causes potential privacy issues to the users. Our work departs from this restriction and considers only a one-time session for recommendation, where users do not log in. They also discard the strong sequential component of reading news stories, which can be modelled as a Markov process. Classic recommender system approaches such as collaborative filtering require recomputing the model every time. In this work, we propose an incremental algorithm that recomputes the model continuously and with little additional computation [13], and is thus better suited to such a dynamic domain.

We focus on a class of recommender systems based on context trees. Usually, these trees are used to estimate Variable-order Markov Models (VMM). VMMs have been originally applied to lossless data compression, in which a long sequence of symbols is represented as a set of contexts and statistics about symbols are combined into a predictive model [27]. VMMs have many other applications [5].

Closely related, variable-order hidden Markov models

[32], hidden Markov models [23] and Markov models [24, 28, 12] have been extensively studied for the related problem of click prediction. These models suffer from high state complexity. Although techniques [34] exist to decrease this complexity, the main drawback is that multiple models have to be maintained, making these approaches not scalable and not suitable for online learning.

Few works [35, 30, 25] apply such Markov models to recommender systems. Zimdars et al [35]

describe a sequential model with a fixed history. Predictions are made by learning a forest of decision trees, one for each item. When the number of items is big, this approach does not scale. Our approach requires only one tree: the context tree. Shani et al


consider a finite mixture of Markov models with fixed weights. They need to maintain a reward function in order to solve a Markov decision process for generating recommendations. As future work, they suggest the use of a context-specific mixture of weights to improve prediction accuracy. In this work, we follow such an approach. Rendle et al


combine matrix factorization and a Markov chain model for baskets recommendation. The idea of factoring Markov chains is interesting and could be complementary to our approach. However their limitation is that they consider only 1-order Markov chain. A bigger order is not tractable because the states are baskets which contain many items.

Due to the singular properties of news, it is not possible to apply these methods directly and achieve good recommendations, but instead we need tailor-made models. Surprisingly, we do not know of any existing research that considers context-tree models to news recommender systems.

3 Context-Tree Recommender

There are two key ideas behind a Context-Tree (CT) recommender system. Firstly, it cuts the data space into a set of refined partitions called a partition tree. Each subset in every partition is called a context. The contexts are arranged in a tree structure such that for any context in some partition, there is always exactly one context in the previous partition that completely contains it. The resulting tree has nodes corresponding to each context. A context can be the set of sequences ending in a given suffix, or a set of probability distributions. In this work, we focus on sets of sequences of news items and topics, as well as sets of topic distributions.

The second key idea is to assign a local prediction model to each context, called an expert. Each expert gives predictions only for a subset of the space. For instance, a particular expert gives predictions only for users who have read a particular sequence of stories, or users who have read an article that was sufficiently close to a particular topic distribution.

Recommending news articles depends on multiple parameters: the popularity of the news item, the freshness of the story, the sequence of news items or topics that the user has seen so far. We define an expert model for each of these properties and show how to combine them.

For the sequence of items, we introduce three variations of the CT recommender system: a) the standard Variable-order Markov Model (VMM) system models the context as an ordered sequence of news items and the experts predict the next news item, b) the Content-based VMM (CVMM) system considers ordered sequences of topics and the experts predict the next topic, and c) the Hybrid VMM (HVMM) recommender builds a context tree of ordered sequences of topics, but the experts predict the next news item.

The CVMM and HVMM approaches look at sequences of best-matching topic for each item. However, it is possible that each context is limited because they assume contexts representing sequences of individual topics. We present a context-tree recommender system, called -CT, which builds a tree on the partitions of the -dimensional space of topic distributions.

For all those models, each context is associated to an expert who predicts the next item. A weight is assigned to each expert expressing how confident the expert is in its prediction. Recommendations are made by selecting a path in the context tree (i.e. a set of contexts), and combining the weighted predictions of each expert. In the VMM approach for example, we select the path matching the current sequence of read news items. The predictions are propagated along the path from the most general context down to the most specific context, and the weights of the corresponding experts are updated at the same time.

We now detail the CT model and inference procedure for the VMM system. The remaining systems use an equivalent procedure.

3.1 VMM-based Recommender Systems

Because of the sequential nature of news reading, it is intuitive to model news browsing as a Markov process [30]. Readers are in different states at a given time, and recommendations are generated by looking at the transition probability from one state to another. The user’s state can be summarised by the last items visited. We refer to such -long sequences of items as contexts. A context corresponds to one news item, or news items in the case of a -order Markov model. Larger values of lead to contexts that are more informative, but also scarcer.

Variable-order Markov Models extend Markov models so that the context length is not fixed but varies [5]. This flexibility allows the model to use a larger order only in cases where doing so results in better predictions. As a result, it has the advantage of performing well when learning on short sequences and on low-quality datasets.

In the following sections, we first consider a context-tree in the space of sequences of news articles. Then, using the most probable topic of each news item, we look at sequences of read topics. Finally, we combine the two approaches into a hybrid version.

3.1.1 Standard Recommender

A variable-order Markov model recommender builds a context tree representing the sequences of news items. More formally, let be the set of news items and the set of sequences or visits made by anonymous users. We consider a sequence of read news items , . A context is a suffix of , and we write , when the last elements of are equal to .

The context tree is a tree with nodes and edges such that each node corresponds to a unique context , and the th node’s parent has a context . Specifically, the root node corresponds to the empty context , and the child node at depth has the context , where is the concatenation operator, for example .

Each context is associated with an expert , who predicts the next news item. For a specific sequence of news items, a subset of experts is active such that . An expert has a probability distribution over the news items, and we write

for the posterior probability of the next news item being

given the sequence for the expert .

Figure 1: Context tree for the sequence . Nodes in dashed-red are active experts .

We associate a weight to each expert , and define a Bayesian Variable-order Markov Model (BVMM, [13]). Standard approaches use the Context Tree Weighting algorithm [33] which is defined for binary prediction. We use the generalised version, BVMM which incrementally updates the weights with a Bayesian rule. The probability of the next news item being is defined as a mixture of probabilities of all active experts.


We can interpret the weights as the confidence of the prediction made by an expert for a news item given the current sequence. For a given sequence we have a corresponding set of active experts which forms a path in the context tree, starting from the root expert to the leaf expert . We define the root expert to have a weight of 1: , then . We update the weights of the active experts in sequentially as follows:




is the combined prediction of the first experts. Note that the path in the tree and the corresponding set of active experts change for each different sequence . Therefore, the expert is not the same across sequences. The parameters of non-active experts remain the same.

For instance, Figure 1 shows the context tree for the sequence and the active experts . is the prediction of the root expert for the next item , and is the complete prediction by the model. The parameters of each expert are updated subsequently in the manner described in the Section 3.3.

The VMM recommender builds a context tree based on the sequence of news items. A news story is about some topics, hence it is possible to model the behaviour of a reader as a sequence of topics instead of news items. The following two sections illustrate this idea and introduce two variations of the VMM recommender system: the content-based VMM and hybrid VMM recommender systems.

3.1.2 Content-based Recommender

Since the number of news items is very large, perhaps a better approach would be to recommend stories that have similar content with the ones that a user previously read. This can be done by representing the news stories as a vector of features. The system evaluates the similarity between vectors, and recommends the set with the most similar vectors of features. With this, even an item that has not been read by anyone can be recommended.

Most approaches to feature representation use TF-IDF in which a set of keywords or terms is chosen and for each news item, it generates the frequency of occurrence of each keyword but also the frequency of occurrence of each keyword among all news stories. It reduces a news story to a vector containing TF-IDF scores for each keyword. The major drawback is that it does not capture the inter- or intra-structure of news stories very well.

Instead, we use a probabilistic topic model technique to learn the content. In particular, we choose the Latent Dirichlet Allocation (LDA) over other methods such as Probabilistic Latent Semantic Indexing [16] because the later suffers from overfitting in practice [9].

The idea of LDA is that a journalist writes an article with particular topics in mind, and she draws words with a certain probability from a bag of words of each topic. A news story is then represented as a mixture of various topics. The goal is to find a mixture of topics for each news item. We write for the probability distribution over topics in a particular news item , and for the probability distribution over words for a given topic . denotes the probability that the th topic is assigned to the th word in news item , and the probability of within topic . It follows that


Note that we need to specify the number of latent topics in advance. Let be the multinomial distribution over words for topic and the multinomial distribution over topics for news item . The parameters and indicate which words are important for which topic and which topics are significant for a certain item. and

have Dirichlet priors with hyperparameters

and respectively. To estimate and , we use the Gibbs sampling technique [14] in which a set of samples from is sufficient for the estimation.

To apply LDA, we concatenate the title, summary and content of the news item together, then we tokenize the words and remove stopwords. After that, we apply LDA to all the news stories in the dataset, and obtain a topic distribution vector for each news item. Note that the topics might have no meaning because they are neither classified nor named.

We can now redefine the context tree of Section 3.1.1. Using the most probable topic of each news item, we consider sequences of read topics . The context is a suffix of a sequence of read topics. The remainder of the model stays the same as in the VMM recommender system. The only difference is that we have topics instead of news items. Hence we cannot recommend news items directly, but we generate recommendations by combining the probability of the next topic with the probability of that topic for each news item. The score of a news item is given by


The system evaluates each candidate news story and recommends the news items with the highest scores. We name this system the Content-based VMM recommender system (CVMM RecSys).

3.1.3 Hybrid Recommender

We combine the standard VMM with the content-based VMM recommender system into a hybrid version. The context tree is built on topics, similarly to the CVMM system, but the experts make predictions about news items, like the VMM system.

The Hybrid VMM recommender system (HVMM RecSys) builds a tree in the space of topic sequences. The context is a suffix on a sequence of most probable topics . The sets of fresh and popular items contain news stories and not topics. All probabilities (Equation 1 and the related Equations) are defined with respect to news items.

The tree structure is very limited because the context is constrained to a sequence of individual topics. In the next section, we lift this restriction to fully exploit the topic distribution.

3.2 k-d Context-Tree Recommender

For a given news story, we obtain a topic distribution. The CVMM and HVMM structures seen before use only the most probable topic to construct the sequence. However, perhaps the complete topic distribution of the last news item is more important than the temporal sequence of most probable topics. For this reason, we use a -d tree to build a context model in the space of topic distributions.

A -d tree is a binary tree that partitions a -dimensional space into smaller spaces [7]

. A node corresponds to a hyperplane splitting the space of topic distributions into two hypercubes. We associate one node to one of the

-dimensions, and its corresponding hyperplane is perpendicular to that dimension’s axis. A leaf stores at least one topic distribution. For instance, a node associated to dimension splits the space into two subtrees at : a topic distribution with a smaller value for the dimension , i.e. lies in the left subtree and with a larger value will be in the right subtree.

There are various ways to construct a -d tree, depending on the chosen partitioning strategy. A simple idea is to select the axis based on the depth such that we cycle through all possible axis: .

In order to use this structure as a context tree, we assign a context to each node as before, however the context represents a subset of the possible topic distributions. Every time the system observes a new topic distribution, the distribution is added to the -d tree, and possibly the tree expands. We refer to this method as the k-d Context-Tree recommender system (-CT RecSys).

3.3 Expert Model

Recommending news articles depends on multiple factors: the popularity of the news item, the freshness of the story, the sequence of news items or topics that the user has seen so far. We define a model for each of these properties, and show how to combine them. The first model ignores the temporal dynamics of the process. The second model accounts for the possibility that the users may be mainly looking at popular items, while the last model assumes that users are mostly interested in fresh items (i.e. breaking news). Each context corresponds to an expert , which calculates the posterior probability of the next item given any sequence in its corresponding context.

All three models are constructed through Dirichlet priors. A prior mass is assigned to all possible outcomes, such that the prior probability that the outcome

is in some set is proportional to the mass of . More formally, if is the mass of , then . The prior is updated via counting: whenever a new item is read, the mass of all sets containing increases by 1, so that , where equals 1 if , 0 otherwise.

3.3.1 Standard

A naïve approach for estimating the multinomial probability distribution over the news items is to use a Dirichlet distribution on multinomial parameters for each expert . The probability of reading a particular news item depends only on the number of times it has been read when the expert is active.


where is the initial count of the Dirichlet distribution.

The dynamic of news items is more complex. A news item provides new content and therefore has been seen by few users. News is subject to trends and frequent variations of preferences. We improve this simple model by augmenting it with models for popular or fresh news items.

3.3.2 Popularity

A news item is in the set of popular items when it has been read at least once among the last read news items. We compute the probability of a news item given that is popular as:


where is the total number of clicks received for news item . Note that is not equal to (Eq. 6). is the number of clicks for news item when the expert is active, while is the number of clicks received by news item in total whether the expert is active or not.

The number of popular items is important because it is unique for each news website. When is small, the expert considers only the most recent read news. It is possible to tune this parameter to achieve better performance.

3.3.3 Freshness

A news item is in the set of fresh items when it has not been read by anyone but is among the next news items to be published on the website, i.e. a breaking news. We compute the probability of news item given that is fresh as:


The number of fresh items influences the prediction made by this expert, and it is also unique for each news website.

3.3.4 Mixing the expert models

We combine the three expert models using the following mixture:

There are two ways to compute the probabilities , and : either by using a Dirichlet prior that ignores the expert prediction or by a Bayesian update to calculate the posterior probability of each expert according to their accuracy.

For the first approach, the probability of the next news item being popular is:


where represents the number of times the expert has been active, and the number of read news items which were respectively popular and not popular when the expert was active.

Similarly, the probability of the next news item being fresh is given by:


where is the number of read news items which were fresh when the expert was active.

Finally, the probability of the next news item being neither popular nor fresh is:


Note that .

It might happen that by using the Dirichlet priors, predictions are mainly made by only one expert model. To overcome this issue, we compute the probabilities , and via a Bayesian update, which adapts them based on the performance of each expert model:

1:procedure Learn(, context set )
2:      // number of experts
4:// loop from the most general expert to the most specific expert .
5:     for  do
9:         if BayesianUpdate then
10:              update (Eq. 3.3.4)          
11:         update according to
12:// if the context is not in the tree, then add a new leaf node.      
13:     if  then
15:end procedure
16:procedure Recommend()
17:     for all candidate  do
20:// loop from the most general expert to the most specific expert .
21:         for  do
23:      sort all by in descending order
24:     return first elements of
25:end procedure
Algorithm 1 VMM recommender system

Algorithm 1 presents a sketch of the context-tree recommender algorithm. For simplicity, we split the recommender system in two procedures: learn and recommend. However, our implementation combines the two to have a complete online algorithm which makes recommendations while learning, and thus do not need offline computation. The system estimates the probability of each candidate and recommends the news items with the highest probability. When the recommender system needs to estimate the probability of a candidate item, the system 1) selects the active experts which correspond to a path in the context tree from the most general context to the most specific context, 2) propagates from the root down to the leaf, i.e the most specific context. The at the leaf expert is the estimate probability of the recommender system for the candidate item , i.e. (see Eq. 1).

4 Evaluation and Comparison

In this evaluation, we are interested in whether the class of CT recommender systems has an advantage over standard methods and if yes, what is the best combination of partition and expert model under an arbitrary usage scenario.

More specifically, we answer the following questions:

  1. What role do the different experts play? What is the best way to update the weights of their mixture?

  2. Is the temporal sequence important when recommending news articles?

  3. Does the content of the news stories help to make good recommendation?

  4. Do CT recommender systems make novel recommendations?

    Novelty is essential because it exposes the reader to relevant news items that she would not have seen by herself. Obvious but accurate recommendations of most-popular items are of little use.

We evaluate our systems on two datasets. We use the first dataset to examine the sensitivity of the CT models to hyperparameters and compare them to existing techniques. For the second part of the experiments, we perform an unbiased comparison between the different CT models, whereby we first select a particular evaluation criterion, then we select the optimal hyperparameters for that criterion on the first dataset, and then measure the performance on the second dataset. This methodology [6] mirrors the approach that would be followed by a practitioner who wants to implement a recommender system on a newspaper website.

4.1 Datasets

We collected data from the websites of two daily Swiss-French newspapers called Tribune de Genève (TDG) and 24 Heures (24H). TDG and 24H are the most-popular newspapers222In 2011, TDG and 24H had a readership of 138’000 and 223’000, and a circulation of 51’487 and 75’796, respectively. in Cantons of Geneva and Vaud, respectively. Their websites contain news stories ranging from local news, national and international events, sports to culture and entertainment.

The datasets span from Nov. 2008 until May 2009. They contain all the news stories displayed, and all the visits by anonymous users within the time period. Note that a new visit is created every time a user browses the website, even if she browsed the website before. The raw data has a lot of noise due to, for instance, crawling bots from search engines or browsing on mobile devices with unreliable internet connections. Table 4.1 shows the dataset statistics after filtering out this noise, and Figure 2 illustrates the distribution of visit length for each dataset.

Datasets after filtering. News stories Visits Clicks TDG 10’400 600’256 1’069’131 24H 8’613 249’099 509’978

Figure 2: Distribution of the length of visits.

4.2 Evaluation Metrics

There has been a lot of discussion on the best way of evaluating recommender systems [15, 29]. The best would be to implement them on an actual site and measure the click rate on recommended items. Unfortunately, this is usually far too costly to do, and evaluation has to be carried out based on behaviour that was observed without the recommender being present. In our case, we have visit histories from the newspaper websites, and we can evaluate how well our recommendations match the news items that readers selected themselves. It is clear that this is a somewhat inaccurate measure: a) the user may not have liked all the items she visited; b) the user may have preferred one of the recommended items to the one she clicked, so the fact that a recommended item was not visited does not mean the recommendation is bad. However, we believe that prediction of the visit history is still a useful way to compare the performance of different techniques, and so we use it here.

We evaluate how good the systems are in predicting the future news a user is going to read. Specifically, we consider sequences of news items , read by anonymous users. The sequences and the news items in each sequence are sorted by increasing order of visit time. When an anonymous user starts to read a news item , the system generates recommendations. As soon as the user reads another news item , the system updates its model with the past observations and , and generates a new set of recommendations. Hence the training set and the testing set are split based on the current time: at time , the training set contains all news items accessed before , and the testing set has items accessed after .

For a given sequence and the current news item in this sequence, we define as the set of successor news items such that , and as the set of recommended news items. We say that a recommended news item is relevant if it is in the successor set. We always recommend 5 news stories, and we use two metrics to evaluate how good the recommendations are: () and Mean Average Precision (MAP).

is equal to if the immediate successor of the current items is recommended among the first 5 recommended news stories, otherwise.

It is interesting to consider the order in which the recommended news are presented by the system. Since the recommendation set is actually an ordered list of recommended news, we can compute the precision at every position in the ranked sequence of news stories.


The average precision at the rank of each relevant news stories is calculated as:


where is the number of recommendations. equals 1 if the news at rank is relevant, 0 otherwise.

Finally, the Mean Average Precision is the mean of the average precision for a given set of queries , i.e. for each recommendation set the system generates.


captures how good recommended news stories are against the immediate successor. However, MAP looks at all future news stories and how the recommended news are ordered.

We briefly recall the systems we evaluate:

VMM RecSys

is the standard VMM recommender system in which browsing behaviours are modelled as an ordered sequence of news items (Sec. 3.1.1).


is a pure content-based approach where each news story is labelled with the most probable topic (Sec. 3.1.2).


is a hybrid VMM recommender system which brings together the structure of the content-based system with the prediction of the standard VMM method. (Sec. 3.1.3).

-CT RecSys

is a variation of the hybrid VMM recommender system, but with a different context-tree structure using the entire topic distribution (Sec. 3.2).

Figure 3: VMM recommender system: different mixtures of experts (Bayesian update, ).
(a) VMM and -order Markov chain
(b) context-tree recommender systems
Figure 4: Accuracy for personalized news items (std + pop + fresh, ).

4.3 Results

For all systems, we use a prior of for the Dirichlet models and the initial weights for the experts as , where is the depth of the node the expert is assigned to. For the topic-based solutions, we evaluated experimentally the optimal value in the range from 30 to 500 topics. Increasing the number of topics did not raise significantly the performance. So we decided to set the number of topics to 50 because it is a reasonable choice between accuracy and complexity. We varied the number of popular items from 10 to 500. When is small, the experts consider only the most recent read news stories as candidates. The number of fresh items

from 10 to 100, the mixture of experts (standard, popularity and/or freshness) and whether the probabilities are computed via Bayesian update or not. We report averages over all recommendations with confidence intervals at 95%. Due to space constraints, we omit figures for the TDG dataset, but we witnessed the same behaviours.

  1. What role do the different experts play? What is the best way to update the weights of their mixture?

    The mixture of expert models plays an important role in the performance. Bayesian update for the weights is more robust.

For instance in Figure 3, mixtures integrating the popularity model are very sensitive to the number of popular items while others are more robust. We see that there is an optimal number of popular items for which a recommender system gives the best accuracy, but also that the strategy to always recommend the most popular items does not pay off when the number of popular items increases. ”Good” recommendations are drowned in popular items. Although naïve, this approach of recommending the most popular stories is actually used very often on newspaper websites.

We noticed that, when using the Dirichlet priors to update the mixture probabilities, the prediction was mostly made by the popularity model, resulting in the same behaviour as the most-popular recommender system as the number of popular items increases. However, as the Bayesian update (Eq. 3.3.4) adapts the probabilities based on the performance of each expert model, it is more robust when we increase . We also observed that as the number of fresh items increases, CT models are getting slightly better for both metrics.

Figure 5: Weights distribution of the VMM over the depth of the context tree (Bayesian update, , ).
(a) VMM
(b) -CT
Figure 6: Accuracy and novelty for context-tree recommender systems (std + pop + fresh, ).
  1. Is the temporal sequence important when recommending news articles?

    Yes, the temporal sequence increases the accuracy.

Figure 4(a) shows that the VMM recommender system performs better than fixed -order Markov chain recommenders such as the ones by Zimdars et al [35]. In Figure 4, we consider only personalized items. For each approach, we removed the popular items from the recommendation set , and get a reduced set . The set contains the most popular news stories recommended by the most-popular approach. Therefore, the set has only personalized recommendations.

In addition, the weights of the experts for the VMM recommender system are well distributed over the space even for long sequences (Fig. 5). If the sequence is not important, the weights of the experts for depths higher than 1 would have been set to 0.

  1. Does the content of the news stories help to make good recommendation?

    No, the content of news stories does not help.

For instance in Figure 4(b), we focus only on novel recommendations since the most popular items are taken out. Pure content-based approaches such as CVMM and -CT systems perform particularly poorly. CVMM has also the worst performance in the comparison of CT models presented in the next section. We do not observe that the content allows both novel and accurate recommendation, which contradicts common wisdom in the research community.

  1. Do CT recommender systems make novel recommendations?

    Yes, but there is a trade-off between accuracy and novelty.

We define the novelty as the ratio of unseen and recommended items over the recommended items: . In general, CT recommenders generate novel recommendations (see Fig. 6). However, CVMM does not provide any novel items. -CT recommender generates a lot of novel items, and seems to be the best trade-off between accuracy and novelty. However, we are not sure about the accuracy of novel items (see Fig. 4(b)) since as discussed in Section 4.2 we are only looking at traces and people may not have found the novel items. Hence the -CT recommender could still be a very good option in practice as it generates a lot of novel results.

(a) Tuning: 24H dataset
(b) Testing: TDG dataset with optimal parameters from 24H dataset.
Figure 7: Expected performance curves: accuracy and novelty trade-off for context-tree recommenders.

4.4 Comparison

The best way to implement a recommender system depends upon the system designer’s goals, or the user’s preferences. For news recommendation in particular, we are facing a trade-off between novelty and accuracy. We formalise the preferences of the designer with respect to this trade-off, via the following utility function:


where specifies how the trade-off between accuracy and novelty is made, is the dataset (24H or TDG) and is an assignment of parameters. For CT systems, the parameters are the number of popular and fresh items, whether the probabilities are computed via a Bayesian update or not, the mixture of experts (standard and (popularity and/or freshness)).

We can now simulate the process of a designer who is going to tune the recommender system on a small dataset (24H), before deploying the recommender online (on the TDG dataset). For any given value of , we find the best parameters for the 24H dataset, and then measure the performance on the other dataset. This gives the expected performance curve [6], which provides us an unbiased evaluation of all systems’ performance.

Figure 7 illustrates the expected performance curves for both 24H and TDG datasets, The first curve in Figure 7(a) shows the optimal utility with for dataset. Figure 7(b) shows the corresponding utility achieved on the test dataset using the parameters found for the tuning dataset. It is clear that all methods are robust enough that their curves are the same for both datasets even though the parameters are tuned only on the smallest one. No matter how the trade-off is selected, the pure content-based method is worse than all of the other approaches. This shows that the topic alone does not identify which article an individual will read.

5 Conclusion

Because of the abundance of news on the web, news recommendation is an important problem, but also challenging due to the natural properties of a news item: when a news story is very recent, there is little data available to generate recommendations. Moreover, it is subject to trends and preference changes over time.

In this paper, we introduced a class of recommender systems based on context trees that bring relevant and interesting news articles to readers. Classic recommender system approaches such as collaborative filtering require to recompute the model every time. In this work, we proposed an incremental algorithm that recomputes the model continuously, and is thus better suited to such a dynamic domain.

We considered different context trees in the space of sequences of news, sequences of topics, and in the space of topic distributions. More specifically, we presented the VMM recommender in which browsing behaviours are modelled as an ordered sequence of news items; the CVMM recommender where each news story is labelled with the most probable topic; the HVMM which brings together the structure of CVMM system with the prediction of the standard VMM method; and finally the -CT recommender using the entire topic distribution.

In the context of news recommendations, we defined different expert models which consider the popularity and freshness of news items, and examined ways to combine them into a single model.

In conclusion, we showed that CT recommender systems are flexible enough to capture the properties of news items, and perform better than existing techniques.

We demonstrated that a) a sophisticated expert model can improve recommendation quality, b) the temporal sequence is important for recommendation, c) the content does not help in making good recommendations because the topic alone is not enough for personalized recommendations. We think this is because users do not like to read multiple stories about the same topic. d) CT models achieve a good trade-off between novelty and accuracy. Individual behaviour plays an important role. Finding a good model that characterizes this individual behaviour is an open research question. We believe that news websites should consider these techniques for keeping readers interested in their sites.

For future work, we would like to examine different expert models. For example, we could additionally consider the time a reader spends on a given news article or topic. Finally, in order to accurately evaluate the performance of systems that recommend many novel items, we intend to conduct an online user study.


  • [1]
  • Abel et al. [2011] F. Abel, Q. Gao, G. Houben, and K. Tao. 2011. Analyzing User Modeling on Twitter for Personalized News Recommendations. User Mod., Adaption and Perso. (2011), 1–12.
  • Adomavicius and Tuzhilin [2005] G. Adomavicius and A. Tuzhilin. 2005. Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. Trans. on KDE 17 (2005), 734–749.
  • Ahn et al. [2007] J. Ahn, P. Brusilovsky, J. Grady, and D. He. 2007. Open user profiles for adaptive news systems: help or harm?. In Proc. of Int. Conf. on World Wide Wweb. 11–20.
  • Begleiter et al. [2004] R. Begleiter, R. El-Yaniv, and G. Yona. 2004. On prediction using variable order Markov models. J. Artif. Int. Res. (2004), 385–421.
  • Bengio et al. [2005] S. Bengio, J. Mariethoz, and Keller M. 2005. The Expected Performance Curve. In

    Proc. of 22nd International Conference on Machine Learning

    . 9–16.
  • Bentley [1975] J. Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM (1975), 509–517.
  • Billsus and Pazzani [1999] D. Billsus and M. Pazzani. 1999. A hybrid user model for news story classification. In Proc. of Conf. on User Modeling. 99–108.
  • Blei et al. [2003] D. Blei, A. Ng, and M. Jordan. 2003. Latent Dirichlet Allocation. J. of MLR 3 (2003), 993–1022.
  • Burke [2002] R. Burke. 2002. Hybrid Recommender Systems: Survey and Experiments. User Modeling and User-Adapted Interaction 12 (November 2002), 331–370. Issue 4.
  • Das et al. [2007] A. Das, M. Datar, A. Garg, and S. Rajaram. 2007. Google news personalization: scalable online collaborative filtering. In Proc. of Int. Conf. on World Wide Wweb. 271–280.
  • Deshpande and Karypis [2004] M. Deshpande and G. Karypis. 2004. Selective Markov models for predicting Web page accesses. ACM Trans. Internet Technol. 4, 2 (2004), 163–184.
  • Dimitrakakis [2010] C. Dimitrakakis. 2010. Bayesian Variable Order Markov Models. In Proc. of AIStat. 161–168.
  • Griffiths and Steyvers [2002] T. Griffiths and M. Steyvers. 2002. A probabilistic approach to semantic representation. In Proc. of the 24th Conf. of the Cognitive Science Society.
  • Herlocker et al. [2004] J. Herlocker, J. Konstan, L. Terveen, and J. Riedl. 2004. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22 (2004), 5–53. Issue 1.
  • Hofmann [1999] T. Hofmann. 1999. Probabilistic Latent Semantic Indexing. In Proc. of the 22nd Int. Conf. on Research and Development in Information Retrieval. 50–57.
  • IJntema et al. [2010] W. IJntema, F. Goossen, F. Frasincar, and F. Hogenboom. 2010. Ontology-based news recommendation. In Proc. of Workshop on Data Semantics. 1.
  • Konstan et al. [1997] J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, and J. Riedl. 1997. GroupLens: applying collaborative filtering to Usenet news. Commun. ACM 40 (March 1997), 77–87. Issue 3.
  • Lang [1995] K. Lang. 1995. Newsweeder: Learning to filter netnews. In Proc. of the 12th Int. Conf. on Machine Learning. 331–339.
  • Li et al. [2010] L. Li, W. Chu, J. Langford, and R. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proc. of Int. Conf. on World Wide Web. 661.
  • Liu et al. [2010] J. Liu, P. Dolan, and E. Pedersen. 2010. Personalized news recommendation based on click behavior. In Proc. of the Conf. on Intelligent User Interfaces. 31–40.
  • Lops et al. [2011] P. Lops, M. Gemmis, and G. Semeraro. 2011. Content-based Recommender Systems: State of the Art and Trends. In Recommender Systems Handbook. Springer, 73–105.
  • Montgomery et al. [2004] A. Montgomery, S. Li, K. Srinivasan, and J. Liechty. 2004. Modeling Online Browsing and Path Analysis Using Clickstream Data. Marketing Science 23, 4 (2004), 579–595.
  • Pitkow and Pirolli [1999] J. Pitkow and P. Pirolli. 1999. Mining longest repeating subsequences to predict world wide web surfing. In Proc. of the conf. on USENIX Symp. on Int. Tech. and Sys. 13.
  • Rendle et al. [2010] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme. 2010. Factorizing personalized Markov chains for next-basket recommendation. In Proc. of WWW. 811–820.
  • Resnick et al. [1994] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. 1994. GroupLens: an open architecture for collaborative filtering of netnews. In Proc. of the Conf. on Computer Supported Cooperative Work. 175–186.
  • Rissanen [1983] J. Rissanen. 1983. A Universal Data Compression System. Trans. Info. Th. (1983), 656–664.
  • Sarukkai [2000] R. Sarukkai. 2000. Link prediction and path analysis using Markov chains. Computer and Telecommunications Networking 33, 1-6 (2000), 377–386.
  • Shani and Gunawardana [2011] G. Shani and A. Gunawardana. 2011. Evaluating Recommendation Systems. In Recommender Systems Handbook. Springer, 257–297.
  • Shani et al. [2005] G. Shani, D. Heckerman, and R. Brafman. 2005. An MDP-Based Recommender System. Journal of Machine Learning Research 6 (December 2005), 1265–1295.
  • Su and Khoshgoftaar [2009] X. Su and T. Khoshgoftaar. 2009. A survey of collaborative filtering techniques. Adv. in Artif. Intell. (January 2009), 4:2–4:2.
  • Wang et al. [2006] Y. Wang, L. Zhou, J. Feng, J. Wang, and Z. Liu. 2006. Mining Complex Time-Series Data by Learning Markovian Models. In Proc. of ICDM. 1136–1140.
  • Willems et al. [1995] F. Willems, Y. Shtarkov, and T. Tjalkens. 1995. The Context Tree Weighting Method: Basic Properties. IEEE Transactions on Information Theory 41 (1995), 653–664.
  • Zaki et al. [2010] M.. Zaki, C. Carothers, and B. Szymanski. 2010. VOGUE: A variable order hidden Markov model with duration based on frequent sequence mining. Trans. KDD 4 (2010), 1–31.
  • Zimdars et al. [2001] A. Zimdars, D. Chickering, and C. Meek. 2001. Using temporal data for making recommendations. In Proc. of conf. on Uncertainty in art. intel. 580–588.