Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability

05/04/2020 ∙ by Mariflor Vega-Carrasco, et al. ∙ UCL 0

Understanding the shopping motivations behind market baskets has high commercial value in the grocery retail industry. Analyzing shopping transactions demands techniques that can cope with the volume and dimensionality of grocery transactional data while keeping interpretable outcomes. Latent Dirichlet Allocation (LDA) provides a suitable framework to process grocery transactions and to discover a broad representation of customers' shopping motivations. However, summarizing the posterior distribution of an LDA model is challenging, while individual LDA draws may not be coherent and cannot capture topic uncertainty. Moreover, the evaluation of LDA models is dominated by model-fit measures which may not adequately capture the qualitative aspects such as interpretability and stability of topics. In this paper, we introduce clustering methodology that post-processes posterior LDA draws to summarise the entire posterior distribution and identify semantic modes represented as recurrent topics. Our approach is an alternative to standard label-switching techniques and provides a single posterior summary set of topics, as well as associated measures of uncertainty. Furthermore, we establish a more holistic definition for model evaluation, which assesses topic models based not only on their likelihood but also on their coherence, distinctiveness and stability. By means of a survey, we set thresholds for the interpretation of topic coherence and topic similarity in the domain of grocery retail data. We demonstrate that the selection of recurrent topics through our clustering methodology not only improves model likelihood but also outperforms the qualitative aspects of LDA such as interpretability and stability. We illustrate our methods on an example from a large UK supermarket chain.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the grocery retail industry, millions of transactions are generated every day, and thousands of products are available to customers who want to satisfy different shopping needs. Analyzing human consumption, identification of shopping patterns, developing customer knowledge are tasks that demand the application of methods that can cope with a large amount of data and high dimensionality. Moreover, it is important that models provide clear and detailed insights into transactional data. Topic modeling is a natural, scalable statistical framework that can process millions of combinations of products within transactions while maintaining the explanatory power to discover, analyze and understand customer behaviors.

Topic modeling (TM) was originally introduced to uncover the hidden semantic structure in large collections of text corpora (blei2003latent). TM provides methods for automatically organizing, understanding, searching, and summarizing large collections of discrete data. Latent Dirichlet Allocation (LDA) (blei2003latent) is one of the most popular topic modeling techniques. LDA represents each document as a mixture of topics, where each topic is a multinomial distribution over a fixed vocabulary. The generative process of LDA postulates that each document is produced by sampling a distribution over topics and subsequently words within each topic. This process is repeated for every word in the document. LDA disregards the word order and the document sequence assuming that documents are bags of words and exchangeable. In the context of transactional retail data, documents are replaced by transactions and words by products. Thus, transactions are summarised by topical mixtures and topics are distributions over available products across the corpus. This assumption of an unordered bag of products is natural in the grocery retail domain in which products are registered at stores without an inherent order.

Topic model evaluation is typically based on model fit metrics such as held-out-likelihood or perplexity (wallach2009evaluation; buntine2009estimating), which assess the generalization capability of the model by computing the model likelihood on unseen data. However, the LDA likelihood does not capture qualitative aspects such as semantic coherence, and hence these metrics may lead to topic models with less semantically meaningful topics according to human annotators (chang2009reading). Topics may not correspond to genuine and meaningful themes (alsumait2009topic), may exhibit collections of highly frequent words (chemudugunta2007modeling), may be idiosyncratic word combinations that do not reappear consistently among LDA samples (steyvers2007probabilistic), and may exhibit significant variations among models or posterior samples (chuang2015topiccheck). mimno2011optimizing found that topic models may contain up to 10% of nonsensical topics that reduce users’ confidence in the model. Hence, the evaluation of topic models should not be exclusively based on likelihood metrics, but also include topic quality metrics.

Topic model quality is largely understood as topic coherence (lau2014machine; aletras2013evaluating; mimno2011optimizing). newman2010automatic introduced the evaluation of topic coherence of a single LDA posterior draw to measure the difficulty of associating an individual topic with a single semantic concept, and consequently, evaluating topic models by their interpretability. Topic coherence is typically quantified by co-occurrence metrics such as Pointwise Mutual Information (PMI) and Normalized Pointwise Mutual Information (NPMI) (bouma2009normalized), which have been shown to correlate with human annotators in newman2010automatic; lau2014machine.

However, the topic coherence of a single posterior LDA sample does not capture any aspects of posterior variability across samples. A topic associated with a particular semantic concept may appear and disappear across multiple samples, depending on its posterior uncertainty. Also, topics within the same posterior realization may contain product combinations that could be associated with the same semantic concept. It is, therefore, crucial to characterize not only a single posterior LDA sample but the entire posterior distribution.

In response, in this paper we propose a more holistic definition of topic quality which comprises topic coherence, topic distinctiveness and topic stability. Topic distinctiveness measures semantic dissimilarity among topics. Topic stability quantifies topic reappearance among posterior samples, i.e., showing low uncertainty. Thus, topic models of high quality should identify topics that are not only coherent, but also distinctive within, and recurrent among, posterior draws.

We develop post-processing methodology that aggregates and fuses multiple posterior samples of LDA to capture a single summary of semantic modes within the posterior distribution. Our approach is an alternative interpretation to the label-switching problem (stephens2000dealing; jasra2005markov; sperrin2010probabilistic)

. Rather than assigning one-to-one matches of topics across posterior realizations, we use hierarchical clustering to identify recurrent topics by allocating the same label to two topics (either across or within a posterior sample) if they fulfill a theme-distance criterion which can be tuned to the context of interest; here we use cosine distance among distributional measures as it correlates with human judgment on topic similarity

(aletras2014measuring). A clustered topic is then defined as the average word distribution among the group of topics from different LDA samples that exhibit the same theme, and its recurrence is measured by the number of topics within the cluster. Guided by the domain of interest, users can set topic stability thresholds, represented by the number of topics within a clustered topic, to select clustered topics of high recurrence, representing topic modes of low uncertainty. We demonstrate that selecting topic clusters of high recurrence can result in a posterior topic model summary which augments model generalization and topical quality aspects.

We present a customized user study in which experts in the analysis of market baskets assessed topics for their coherence and similarity. We use this study to relate our measures of coherence and similarity to users’ intuitive perception of these concepts. We interpret LDA topics in the application to grocery retail data and show that inferred topics of an LDA sample may not be the most coherent, distinctive and stable. In contrast, we empirically observe that clustered topics of a high recurrence tend to be more coherent, distinct, and clearly more stable than single LDA topics. In comparison to standard LDA models, we show that our methodology achieves similar likelihood and distinctiveness, significantly improves topic coherence, and outperforms topic stability when applied to grocery retail data.

This paper is organized as follows: we discuss related work in section 2. LDA is described in section 3. Section 4 presents the definitions of model generalization, topic coherence, topic distinctiveness, and topic stability. Section 5 introduces a methodology for clustering and selecting recurrent topics. Sections 6, 7, and 8 show the application of grocery retail data from a major retailer in the UK. More specifically, section 6 discusses thresholds for topic coherence and topic similarity obtained from a user study with experts in the grocery retail industry and exhibits the pitfalls of LDA topics. Section 7 demonstrates the advantages of selecting clustered topics of high recurrence. Section 8 displays identified grocery topics and discusses their implications in the grocery retail sector. Finally, we conclude and summarise our findings in section 9.

2 Related work

Topic models have already been applied to retail data in the literature. jacobs2016model applied the LDA model to retail data from a medium-sized online retailer in the Netherlands to identify shopping motivations and to recommend products that are most likely to be purchased. hruschka2014linking analyzed product categories from a small data set of market baskets from a medium-sized German supermarket. In their study, LDA is used to identify combinations of product categories that are ultimately used in a recommender system. A more recent study by the same author (hruschka2016hidden)

compared LDA with methods such as binary factor analysis (BFA), restricted Boltzmann machine (RBF), and deep belief net (DBN). Although the new methods outperform LDA in model generalization, their outcomes are far less interpretable. All the aforementioned papers compared models by likelihood performance without taking into account qualitative aspects such as interpretability or stability of the models’ outcome.

The evaluation of topic models is typically carried out by computing the model’s performance on a secondary task, such as document classification or information retrieval, or by estimating the perplexity of unseen documents

(blei2003latent). Several algorithms that accurately estimate the predictive likelihood of unseen documents are proposed in wallach2009evaluation; buntine2009estimating. Selection methods based on held-out-likelihood are useful for evaluating the predictive power of the model but may infer less semantically meaningful topics (chang2009reading) and may not outperform accuracy in text classification (wang2014multi).

Previous works have already highlighted some of the flaws of topic modeling and have proposed methodological improvements to improve topic coherence. wallach2009rethinking proposed the use of asymmetric priors over document distributions to capture highly frequent terms in their own topics. newman2011improving introduced two regularization methods that improve topic coherence by using an external corpus and after removing stop words. mimno2011optimizing generalized the Pòlya urn model aiming to reduce the number of low-quality topics, although this method did not reduce the number of bad topics. taddy2012estimation; chuang2012termite; sievert2014ldavis proposed distributional transformations to aid the selection of more interpretable terms.

Topics have also been studied for their similarity. li2006pachinko; wang2009mining; newman2009distributed used KL-divergence as the similarity measure to match similar topics. ramage2009labeled; chuang2015topiccheck

aligned topics using cosine similarity.

xing2018diagnosing used cosine distance to quantify the variability of topic distributions. blair2016increasing combined cosine similarity and Jensen Shannon divergence to merge topics from different samples. aletras2014measuring

showed that, among the distributional similarity measures, cosine distance outperforms other metrics including the log odds ratio

(chaney2012visualizing). However, there is no universal threshold that determines when a pair of topics are (dis)similar.

Hierarchical clustering has been used in the literature to interactively align topics (chuang2015topiccheck) and to aggregate topic models (blair2016increasing). The former work assumes that topics align with up to one topic from a different realization. The latter work merges topics from realizations with small and large numbers of topics aiming to improve topic coherence. However, these works do not assess other aspects of topic quality, such as topic distinctiveness and topic stability nor consider the likelihood of the resulting models.

3 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) (blei2003latent) is a probabilistic topic model that represents documents as mixtures over a finite number of topics and topics are distributions over words from a fixed vocabulary. In the context of grocery retail data, transactions are analogous to documents and topics are distributions over a fixed assortment of products.

LDA follows a generative process in which transactions are created by sampling products from topics and topics are sampled from transaction-specific mixtures. More formally, LDA generative process samples topics from a Dirichlet distribution governed by hyper-parameter and topical mixtures from a Dirichlet distribution governed by hyper-parameter . For each transaction, products are sampled with a two-step process. First, a topic assignment is chosen from the transaction-specific topical mixture . Second, a product is sampled from the assigned topic . Mathematically,


As depicted in figure 1, LDA has four types of variables: the corpus-level parameters and , the topic-level variables , the document-level variables , and the word-level variables and . are the only observable variables.

In the context of text data, documents are assumed to be bags of words which implies that the word order within documents is insignificant and disregarded. In the application to grocery retail data, especially in the case of in-store transactions, products being registered in random order is a natural assumption. Thus, we assume that transactions are bags of products. Baskets are assumed independent and exchangeable, so basket metadata such as timestamp and coordinate location are disregarded. Topics are also independent of each other. In this work, we assume that the number of topics is fixed, but the proposed methods can also be applied to models with a variable number of topics such as the Hierarchical Dirichlet Process (teh2005sharing).




Figure 1: LDA graphical representation. Shaded nodes represent observable variables i.e., products. Unshaded nodes represented the hidden variables: topic assignments, topical mixtures, product distributions, Dirichlet parameter over transactions, Dirichlet parameter over topics. is number of topics, is number of transactions and is number of products.

3.1 Inference

The topic distributions and topical mixtures

can be learnt from the topic structure that maximizes the posterior conditional probability:


where z and w

are vectors of topic assignments and observable words, respectively. The conditional joint distribution factorises as:


Parameters and can be easily integrated out due to conjugacy between the Dirichlet and Multinomial distributions. Thus,




where is the number of times that term has been assigned to topic (), is the number of terms in document () that have been assigned to topic , is the total number of terms assigned to topic , and is the number of terms in document .

Although the joint distribution can be computed for any setting of the hidden variables, the posterior distribution cannot be computed since the marginal probability cannot be directly calculated. Therefore, the exact computation of the marginal distribution, and consequently computing the posterior distribution, is intractable.

There are multiple approaches to approximate the posterior distribution such as variational Bayes (blei2003latent), expectation propagation (minka2002expectation) and Gibbs sampling (griffiths2004finding). In this paper, we use Gibbs sampling to learn topic distributions in the application to grocery retail data since the method has shown advantages on computational implementation, memory, and speed.

3.1.1 Gibbs sampling

Gibbs sampling iteratively draws the assignments of words to topics from their full conditional posterior distribution, defined as


where the notation is a count that does not include the current assignment of .

The full conditional distribution can be thought as the product of the probability of the word under topic and the probability of topic under the current topic distribution for document . Consequently, the probability of assigning any particular word token to topic will be increased once many tokens of have been assigned to topic . Similarly, the probability of assigning any particular topic to document will be increased when topic has been used several times in document .

For any single posterior sample , we can infer and from the value by:


3.2 Dirichlet priors

The standard prior distribution in LDA is a symmetric Dirichlet prior governed by a concentration parameter and a uniform base measure, so that topics are equally likely a priori. However, wallach2009rethinking showed that an optimized asymmetric Dirichlet prior over document-specific distributions, i.e., so topics are unevenly likely a priori, improves model generalization and topic interpretability by capturing highly frequent terms in few topics.

The asymmetric Dirichlet prior is estimated using the optimization method Digamma Recurrence Relation (DRR) proposed by wallach2008structured. DRR is equivalent to the fixed-point iteration method proposed by minka2000estimating, but DRR yields efficient computing by recording topic frequencies, and by using the digamma recurrence relation. Then, asymmetric Dirichlet priors are optimized by:


where are the optimal hyper-parameters. is the number of documents in which topic has been seen exactly times, the maximum document size and the digamma function.

4 Topic model evaluation

Topic models are typically evaluated using model fit metrics based on likelihood estimation. In more exploratory applications, topics are assessed by their interpretability or coherence. However, other quality aspects such as topic distinctiveness and topic stability have not previously been assessed. We extend the definition of topic quality to include topic coherence, distinctiveness and stability. We evaluate topic models based on their likelihood and quality metrics.

4.1 Model generalization

Model fit metrics such as perplexity or held-out-likelihood of unseen documents estimate a realization’s capability for generalization or predictive power. Perplexity is a measurement of how well the probability model predicts a sample of unseen (or seen) data. A lower perplexity indicates the topic model realization is better at predicting the sample. Mathematically,


where is a set of unseen words in a document, is the number of words in , the set of inferred topics, is a Dirichlet prior.

Computing the log-likelihood of a topic model on unseen data is an intractable task. Several estimation methods are described in wallach2009evaluation; buntine2009estimating. In this paper, we use perplexity to evaluate the performance of inferred topics and the left-to-right algorithm to estimate the log-likelihood on held-out documents.

4.2 Topic coherence

A topic is said to be coherent when its most likely terms can be interpreted and associated with a single semantic concept. For instance, ‘a bag of egg noodles’, ‘a package of prepared stir fry’ and ‘a sachet of Chinese stir fry’ sauce are items that can be easily associated with the topic of ‘Asian stir fry’. On the other hand, a non-coherent topic highlights products that do not seem to fulfill a particular customer need. For example, ‘a bag of egg noodles’, ‘a bunch of bananas’ and ‘a lemon cake’ are items that together do not convey a clear purpose.

User studies have shown that metrics of word co-occurrence tend to correlate with human judgment on topic coherence. Thus, topics tend to be coherent when their characteristic words co-appear across the corpus. Typically, co-occurrence is measured by Pointwise Mutual Information (PMI) and Normalized Pointwise Mutual Information (NPMI) (bouma2009normalized). PMI and NPMI measure the co-occurrence of a pair of words. PMI measures the probability of seeing two words in close proximity in comparison to the probability of seeing them individually. NPMI standardizes PMI, providing a score in the range of . NPMI towards corresponds to high co-occurrence. In this paper, we focus on NPMI since it has been shown to have a higher correlation with the human evaluation of topic coherence than PMI (lau2014machine). More formally,


PMI and NPMI only compute the co-occurrence of the most representative words. We set these representative words to the 15 most probable terms, following blei2003latent; griffiths2004finding; steyvers2007probabilistic; chang2009reading; newman2010automatic; chaney2012visualizing. The topic coherence score is given by the average NPMI of the most representative word combinations and model coherence is defined by the average of the topic coherence scores.

4.3 Topic distinctiveness

Topic distinctiveness refers to the dissimilarity of one topic in comparison to the topics of the same realization. In other words, a topic is distinctive if no other topic highlights similar products nor exhibits a repetitive theme. For instance, ‘a bottle of sparkling water hint apple’, ‘a bottle of sparkling water hint grape’ and ‘a bottle of sparkling water hint orange’ are items that are interpreted as the topic of ‘flavored sparkling water’. This topic and the ‘Asian stir fry’ topic are distinctive from each other. On the other hand, a topic that is characterized by: ‘a bottle of sparkling water hint lemon’, ‘a bottle of sparkling water hint mango’ and ‘a bottle of sparkling water hint lime’ can be interpreted as non-distinctive from the ‘flavored sparkling water’ topic since they both exhibit the same theme.

Here we use cosine distance, a distributional similarity metric, which has been shown to have a high correlation with human judgment on topic similarity, outperforming other distributional methods such as KL-divergence and the Log Odds Ratio (aletras2014measuring). The cosine distance for two topic vectors and is defined as


We use the minimum cosine distance within a sample as a measure of the topic distinctiveness of an entire LDA posterior sample. Within a realization, topics tend to be distinct among each other, hence the minimum distance also tends to be high. But if there is a degree of similarity between topics, the minimum distance drops and the closer the topics are, the smaller the minimum distance is. We denote the minimum cosine distance of a topic within realization as


where and are topics in .

4.4 Topic stability

Within a set of LDA posterior samples, topics may appear and disappear as a result of posterior uncertainty. For example, among 20 LDA draws, the ‘meal deal’ topic may appear 20 times; the ‘Asian stir fry’ topic may appear 18 times, and the ‘chocolate bars’ topic may appear 10 times. The uncertainty around topics cannot be captured by a single LDA draw and negatively affects practitioners’ confidence in the method. We use the minimum cosine cross-distance of a topic across samples as a measure of topic stability, denoted by


where and are sets of topics in two different posterior samples. Thus, the minimum distance between a given topic of realization and the topics in realization will be 0 if the topic reappears in another realization.

5 Clustering and selection of recurrent topics

We introduce methodology that aims to summarise the posterior distribution of a topic model by quantifying the recurrence of topic modes across samples. Recurrent topics tend to appear several times or most of the time among LDA realizations, showing more stability. In order to group the topics across samples that represent the same theme, we use a hierarchical clustering method that retrieves clusters of topical similarity. The resulting clusters are used to quantify topic recurrence, which is ultimately used to identify and filter out topics of high uncertainty. We distinguish our work from (chuang2015topiccheck; blair2016increasing) by showing that selecting recurrent topics achieves competitive levels of perplexity and topic distinctiveness while outperforming LDA in terms of topic coherence and topic stability.

5.1 Clustering of topics

Agglomerative hierarchical clustering (AHC) is a widely used statistical method that groups units according to their similarity, following a bottom-up merging strategy. The algorithm starts with as many clusters as the number of units, and at each step, the AHC merges the pair of clusters with the smallest distance. AHC finishes when all the units are aggregated in a single cluster or when the distance among clusters is larger than a fixed threshold. In comparison to other clustering techniques, AHC does not require fixing the number of clusters a priori.

We use the AHC algorithm to aggregate and fuse topics from multiple realizations. To assess cluster similarity, we use cosine distance (CD) and the average linkage method. We opt for CD since it has outperformed correlation on human evaluation of topic similarity (aletras2014measuring) and human rating of posterior variability (xing2018diagnosing). We opt for the average linkage method since it has empirically worked better than single and complete linkage methods, i.e., single linkage tended to create an extremely large cluster of low coherence, and complete linkage tended to create clusters of low distinctiveness. However, we slightly modify the algorithm to only merge topics whose cosine distance is lower than a user-specified threshold. In this manner, we stop merging topics once the distance between all pairs of clusters exceeds the threshold.

Our version of AHC has two inputs: a bag of topics with realization indices and a CD threshold. It returns a set of clusters . The bag of topics contains topics from several realizations and each topic is associated with a posterior sample index, e.g. 10 LDA samples of 50 topics create a bag of 500 topics that are associated with one of 10 realization indexes. The CD threshold is fixed by the user. Clusters are memberships of topics that belong to different realizations and the average CD among them is lower than the CD threshold. Each cluster is represented by a clustered topic with a cluster size , where . The clustered topic is the average distribution of the topics that share the same membership. The cluster size is the number of members, e.g., clustering 100 identical realizations of 50 topics would retrieve 50 clusters of 100 members each.

5.2 Selection of recurrent topics

Topic recurrence or stability refers to the capability of a topic to consistently reappear across multiple realizations. Given a clustered topic, the recurrence of its associated theme is naturally measured by its cluster size. In other words,


As we will later show empirically, clustered topics of low recurrence do not improve perplexity, nor increase distinctiveness or coherence. In contrast, clustered topics associated with large cluster sizes tend to be more coherent and distinctive. Thus, only a subset of clustered topics are meaningful since not all the clustered topics are useful to represent the corpus.

To identify a subset of clustered topics that well represent the data, we evaluate subsets of different recurrence levels:


where is the number of realizations that composed the bag of topics.

We compute perplexity for each subset and select the subset that achieves a desirable perplexity and has the largest cluster size. In other words, we penalize average perplexity by topic stability, favoring topic modes of low uncertainty. As we will show in the next section, cluster size as a measure of topic recurrence leads to subsets of good topic quality.

6 Application to grocery retail data

We apply topic models in the domain of the grocery retail industry, where grocery transactions are seen as documents and individual products are seen as words. In this application, topics are distributions over a fixed assortment of products and transactions exhibit multiple topics.

We analyzed 1 million grocery transactions from a major retailer in the UK. The corpus contained roughly 10 million items and an assortment of 17,000 products. The held-out-data gathered 1,000 transactions. Transactions were pseudo-anonymized and randomly sampled, covering nationwide stores between September 2017 and September 2018. No personal customer data were used for this research.

6.1 Human judgement on topic coherence and topic similarity

To perform our post-processing clustering, a meaningful cosine similarity threshold needs to be set. Similarly, absolute measures of coherence can only be meaningful within the application context. To this end, we carried out a user study to collect human judgment on topic similarity and topic coherence and set empirical thresholds driven by users’ interpretations. Experts from a leading data science company specializing in the grocery market participated in the user study.

As mentioned before, topic coherence evaluates whether individual topics can be easily linked to semantic themes. Topic distinctiveness evaluates the similarity between two topics. Users were asked to evaluate topics using a discrete scale from 1 to 5. For topic similarity, a score of 1 refers to highly different topics, and a score of 5 refers to highly similar topics. For topic coherence, a score of 1 refers to highly incoherent topics, and a score of 5 refers to highly coherent topics. Topics were represented by the top 10 most probable words and were sampled from LDA realizations of and topics and with varying values for hyper-parameters and . The range in the number of topics corresponds to an initial belief of having no less than 25 topics and no more than 150 topics. 153 and 112 evaluations for topic distinctiveness and topic coherence were collected, respectively.

(a) Topic Similarity
(b) Topic Coherence
Figure 2: Human evaluation of topic similarity and topic coherence of retail topics. Plot 1(a) shows similarity scores against the cosine distance between compared topics. Plot 1(b)

shows coherence scores against topic NPMI. Blue error bars show means and confidence intervals for the means. Interpreting results, a

indicates high similarity while indicates high dissimilarity. It is also observed responds to incoherent topics and responds to highly coherent topics.
(a) Generalization
(b) Topic coherence
(c) Topic distinctiveness
(d) Topic stability
Figure 3: Summary of LDA realizations of 50, 100, 150 and 200 topics on grocery retail data. Error bars show means and confidence intervals. In our application, LDA realizations of 100 and 150 topics achieve better generalizations than LDA realizations with 50 or 200 topics. Realizations with more topics tend to have more coherent topics but also more incoherent topics. Within realizations, topics may show some degree of similarity. We observe that topics may not reappear with high similarity across realizations.

Figure 1(a) compares human judgment on topic similarity against cosine distance. Unsurprisingly, the lower the cosine distance, the more similar the topic distributions are. We observe that is associated with topics of high similarity, while is associated with topics of high dissimilarity. Topics at might show some degree of similarity.

Figure 1(b) compares human judgment on topic coherence against NPMI. Despite the subtle positive correlation, there is no clear boundary of NPMI that can precisely identify coherent topics. However, we perceive that topics with are interpreted as highly incoherent and topics with as highly coherent. We use these interpretations to guide the results in the next sections.

6.2 LDA performance

We assess LDA models on the application of grocery retail data on four quality aspects: generalization (perplexity), topic coherence (NPMI), topic distinctiveness () and topic stability (). Following wallach2009rethinking, we train LDA with an asymmetric Dirichlet prior over document-specific distributions and optimization of Dirichlet parameters. We obtain 20 independent posterior realizations of LDA using 50, 100, 150 and 200 topics, using random initialization seeds, initial guesses of and and 1000 iterations. We show that single LDA realizations are not guaranteed to generate coherent, distinctive nor stable topics.

Figure 2(a) shows perplexity performance of LDA models. Realizations of 100 topics tend to have the best generalization capability. Realizations of 150 topics achieve lower perplexity than realizations of 100 topics, but better perplexity than realizations of 50 and 200 topics.

Figure 2(b) shows that larger models tend to have highly coherent topics () but also highly incoherent topics (). Notice that realizations of 50 topics achieved the worst perplexities, however, they showed the highest average of topic coherence due to the lack of low NPMI values. In agreement with chang2009reading, realizations with higher coherence do not necessarily have the best likelihood. Models with too many topics might not outperform perplexity due to the presence of some incoherent topics.

In Figure 2(c), we compute the minimum cosine distance among topics of the same realization, , to measure the topic distinctiveness. If two topics within a realization exhibit similar distributions, and therefore similar themes, then the cosine distance tends to 0. In this application, we observe that distinctiveness increases as long as the number of topics increases too. None of the realizations present highly similar topics (), but some degree of similarity is shown ().

In Figure 2(d), we measure topic stability by calculating the minimum cosine cross-distance between one topic and topics from a second realization, namely . If one topic reappears in a second realization, then the minimum cosine distance tends to 0. Vice-versa, if the topic is not inferred by a second realization, then the minimum cosine distance tends to 1. Among the evaluated LDA samples, we observe that several topics obtain high cosine distances (), indicating that uncertain topics may not reappear in second realizations.

(a) NPMI = 0.59
(b) NPMI = -0.1
Figure 4: Within a LDA realization of 100 topics, 3(a) and 3(b) show the topics with the highest and lowest NPMI; 3(c) and 3(d) correspond to less distinctive (more similar) topics. Brand names have been replaced by XXX for anonymization purposes.

To highlight the flaws of individual LDA realizations, we analyze the LDA realizations of 100 topics that obtained the lowest perplexities. Realizations may not include topics that are the most coherent with high values of NPMI. Coherent topics exhibit clear shopping motivations, and may have or may not show products from the same category. Figures 3(a) and 3(b) show topics with the highest and lowest values of NPMI within the same realization, respectively. While the former topic can be easily interpreted as ‘Branded soup’ and its top products belong to the same category, the latter topic gathers products from different categories without a clear shopping motivation. Realizations may infer topics that are less distinctive and that describe the same or similar themes. We observe topics with a degree of similarity ( and ) in Figures 3(c) and 3(d). Both topics belong to the same realization, are described by the same products (with some exceptions), and their interpretation may be associated with the same theme. Topics may reappear and disappear among realizations. As depicted in Figure 7(a), two realizations of 100 topics do not include the same set of inferred topics. When a topic reappears in a second realization with high similarity, the cosine distance is expected to be less than 0.1, and with some degree of similarity when the cosine distance is less than 0.3 but larger than 0.1. In Figure 7(a), 20% of the inferred topics show a minimum cosine distance larger than 0.3, indicating that they were not found in the second realization even with some degree of similarity. In summary, the LDA realizations with the best perplexity do not always have topics that are coherent, distinctive and stable. This highlights the need for methods that can collect high-quality topics while maintaining low perplexity.

7 Clustering and selection of recurrent topics

We apply our proposed methodology to summarize the LDA posterior realizations and focus on more stable topics. We will show that non-recurrent topics do not improve perplexity and tend to have low coherence. More importantly, subsets of clustered topics augment topic coherence and topic stability without sacrificing model generalization.

We explore LDA samples of 100 topics since they achieved the lowest perplexities in our application. We explore these models by clustering a bag of 2000 LDA topics coming from 20 LDA samples of 100 topics. We repeat this experiment 5 times. In each experiment, a bag of topics is created using 20 different LDA samples and no sample is shared across experiments. We run HC with the cosine distance thresholds of , assuming that already corresponds to dissimilar topics and might not be sufficient to form distinctive clusters. Subsets of clustered topics are formed and sorted by topic recurrence (minimum cluster stability) in decreasing order, i.e., clusters with 20 topics form the first subset, clusters with a minimum of 19 topics form the second subset, and so on, until clusters with a minimum of 1 topic (all clustered topics) form the last subset. For each subset, we compute perplexity, NPMI, and .

(a) Generalization
(b) Coherence
(c) Distinctiveness
(d) Stability
Figure 5:

Generalization, coherence, distinctiveness, and stability of clustered topics. Error bars indicate mean and one standard deviation. Blue lines show mean and one standard deviation of LDA samples.

Figure 4(a) shows the average perplexity and error bars of subsets of clustered topics formed at different minimum cluster sizes and cosine distance thresholds. For visualization purposes, perplexities larger than 8.5 are not shown. The average perplexity of LDA models with 100 topics is depicted by blue dashed lines. Depending on the cosine distance threshold, subsets of clustered topics can achieve significantly better perplexities than LDA realizations of 100 topics. More importantly, perplexity gets significantly worse when selecting non-recurrent topics (minimum cluster size of 1), i.e., topics that only appear once across the clustered LDA samples. For the 0.15 and 0.25 CD thresholds, perplexity rapidly decreases while reducing the minimum cluster size, but perplexity reaches a plateau when subsets include low-recurrent topics (minimum cluster size goes from 2 to 8 and from 3 to 9, respectively). This indicates that perplexity is not significantly improved by selecting more topics of low recurrence. Note that subsets created by 0.35 and 0.45 cosine distance thresholds obtained significantly larger perplexities. Large cosine distances may merge clusters or may join topics that are associated with different themes, which in either case deteriorate the generalization capability of the subset.

Figure 4(b) displays the average NPMI and error bars of subsets of clustered topics at different minimum cluster sizes and cosine distance thresholds. The average NPMI of LDA models with 100 topics is depicted by blue dashed lines. The measure of topic coherence continuously decreases when lowering the minimum cluster size and plummets when subsets include non-recurrent topics (minimum cluster size of 1). This implies a relationship between coherence and recurrence. Thus, the most recurrent topics are the most coherent; vice versa, the non-recurrent topics are the least coherent. Subsets of clustered topics with a minimum cluster size larger or equal to 4 show average values of NPMI that are significantly larger than the LDA’s NPMI average of 0.35 (see Table 1).

Figure 4(c) shows the average and error bars of the cosine distance within samples at different minimum cluster sizes and cosine distance thresholds. Average of LDA models with 100 topics is depicted by blue dashed lines. The distinctiveness measure gradually decreases when lowering the minimum cluster size. This indicates that clustered topics of low recurrence also show some degree of similarity with clustered topics of large recurrence. Note that subsets that include low-recurrent topics and CD thresholds of 0.15, 0.25, and 0.35 obtain average values that are significantly lower than the average distinctive measure of LDA models (see Table 1). Thus, subsets of clustered topics may lead to topical selections that are less distinctive than topics.

Figure 4(d) exhibits the average and error bars of the minimum cosine distance across samples . Average of LDA models with 100 topics is depicted by blue dashed lines. We observe that decreases when reducing the minimum cluster size, but it starts increasing when the minimum cluster size is less than 7. The best averages and lowest dispersions of are achieved when the minimum cluster size varies from 8 to 10 when CD threshold is 0.25 and from 5 to 8 when CD threshold is 0.15. Roughly, the lowest values are 0.04 and significantly lower than the stability measure achieved by LDA samples (see in Table 1).

Based on this analysis, we select the subset generated by minimum cluster size 9 and 0.25 CD threshold. We compare the performance of this subset against the average performance of the LDA model with 100 topics in Table 1. Results show that the selected subset of clustered topics maintains similar levels of perplexity while significantly improving topic coherence and outperforming topic stability. Note that topic distinctiveness is not improved, which might be the outcome of excluding highly distinctive non-recurrent topics. Also, the selected subset achieves similar perplexity with fewer topics.

Model Topics Generalization Coherence Distinctiveness Stability
Perplexity Ave. NPMI Ave. Ave.
LDA-100 100 8.260 0.004 0.357 0.005 0.555 0.008 0.163 0.012
HC-LDA-100 85.4 1.82 8.253 0.002 0.374 0.003 0.509 0.017 0.040 0.007
Table 1: Generalization, coherence, distinctiveness and stability metrics of LDA samples with 100 topics and subsets of clustered topics obtained from clustering bags of 2000 topics. Clustered topics show slightly lower perplexity, significantly larger average NPMI and significantly larger average of .

7.1 Coherence, Distinctiveness and Stability of Clustered Topics

In this section, we illustrate the coherence, distinctiveness, and stability of clustered topics obtained from clustering LDA realizations. We observe that clustered topics are more coherent, less distinctive but far more stable than inferred topics. Roughly, 90% of clustered topics as opposed to 40% of LDA topics reappear in a second sample.

Topic coherence, measured by NPMI, is displayed in Figure 5(a), which compares NPMI distributions of clustered topics (HCLDA-100) and inferred LDA topics (LDA-100). Error bars show means and confidence intervals. Clustered topics are obtained by clustering 20 LDA realizations of 100 topics, 9 minimum cluster size, and 0.25 cosine distance threshold. Inferred topics come from LDA realizations of 100 topics. Comparing distributions, we observe that clustered topics tend to have large values of NPMI, while LDA samples may include topics of low NPMI. Thus, the average NPMI of subsets of clustered topics is significantly larger than the average NPMI of LDA realizations. Figures 5(b) and 5(c) provide examples of the topics with the highest and lowest NPMI values within a sample of clustered topics. These topics can be associated with the shopping motivations of ‘branded soup’ and ‘health care’, respectively. Note that the ‘branded soup’ topic is also the topic with the highest NPMI within an LDA sample as shown in Figure 3(a). Also, observe that the ‘health care’ topic is much easy to interpret than the topic with the lowest NPMI within an LDA sample as shown in Figure 3(a).

Topic distinctiveness, measured by , of clustered topics (HCLDA-100) and inferred LDA topics (LDA-100) is displayed in Figure 6(a). Error bars show means and confidence intervals. As mention before, clustered topics are obtained by clustering 20 LDA realizations of 100 topics with 9 minimum cluster size and 0.25 cosine distance threshold and inferred topics are obtained from LDA realizations of 100 topics. We observe that LDA topics within samples are more distinctive than the subsets of clustered topics. Non-recurrent topics are distinctive among samples, and therefore, they do not cluster with other topics. Since non-recurrent topics are disregarded from the subset of clustered topics, the overall distribution would have less distinctive values. Analyzing the least distinctive topics within a subset of clustered topics as shown in Figures 6(b) and 6(c), we notice they are also similar to the least distinctive inferred LDA topics as displayed in Figures 3(c) and 3(d).

(a) NPMI distribution
(b) NPMI = 0.59
(c) NPMI = 0.20
Figure 6: 5(a) shows NPMI distribution of clustered topics and LDA inferred topics. 5(b) and 5(c) display the clustered topics with the highest and lowest NPMI, respectively. Brand names have been replaced by XXX for anonymization purposes.
(a) distribution
Figure 7: 6(a) shows distribution of clustered topics and LDA inferred topics. 6(b) and 6(c) show the less distinctive (more similar) topics within a subset of clustered topics. Brand names have been replaced by XXX for anonymization purposes.
stability ()
Comparison Topics
HC-LDA-100-I to HC-LDA-100-II 85 77 (90.6%) 6 (7%) 2 (2.4%)
HC-LDA-100-II to HC-LDA-100-I 85 77 (90.6%) 4 (4.7%) 4 (4.7%)
LDA-100-I to LDA-100-II 100 40 (40%) 44 (44%) 16 (16%)
LDA-100-II to LDA-100-I 100 37 (37%) 40 (40%) 23 (23%)
Table 2: Stability of clustered topics and LDA topics. Roughly, 90% of clustered topics reappear in a second subset and 40% of LDA topics reappear in a second realization.
(a) Average
(b) Average
Figure 8: UK grocery retail topics. Topics show a variety of shopping motivations, i.e., diet orientation, cook from scratch, a specific event, a specific-food consumption, promotions, etc. Topics display customer preference driven by economics, family composition, geography, and seasonality.

Topic stability, measured by the average , is computed by calculating the cosine distance between two sets of clustered topics and two samples of LDA with 100 topics. If a topic is recurrent in the second sample, then it is expected that each topic would show a small cosine distance with its pairing topic and large cosine distances with the other topics within the second sample. If so, the diagonal of the cosine distance matrix should show small distances and the sides should show large cosine distances. If the topic is not displayed in a second sample, then it would show large cosine distances to any topic. Figures 7(a) and 7(b) show the cosine distance between two samples of LDA (LDA-100-I and LDA-100-II) and between two sets of clustered topics (HCLDA-100-I and HCLDA-100-II). Clustered topics (HCLDA-100-I and HCLDA-100-II) are obtained by clustering distinct bag of topics. Each bag aggregates 20 LDA samples and no sample is shared between bags. We observe that roughly 90% of clustered topics reappear in a second subset with high similarity (). In contrast, only 40% of the topics reappear with high similarity (); another 40% of the topics reappear with some degree of similarity (); and roughly 20% of the topics do not match with any topic from the other realization (). Also, notice the existence of some topics with high similarity () outside the diagonal. These results are detailed in Table 2. The proposed methodology retrieves more stable clustered topics than individual LDA samples.

8 Discussion of British grocery topics

In this section, we discuss the grocery topics from a major British grocery retailer resulting from the analysis in the previous sections. We present clustered topics of high recurrence obtained with the proposed methodology. We name and interpret topics using the top 15 products according to their conditional probability.

(a) Organic Food
(b) Italian Dish
(c) Roast Dinner
(d) Baby Goods
(e) Scottish Food
(f) Christmas Sweets
(g) Beer
(h) Meal Deal
(i) Biscuits
Figure 9: Topics in the UK grocery retail market baskets. Each topic is characterized by 15 products with the largest probabilities. Probabilities and products are sorted in descending order. Brand names have been replaced by XXX for anonymization purposes. Stability shows the ratio between the number of topics distributions associated with each cluster and the number of posterior draws. Topics show a variety of shopping motivations, i.e., diet orientations, cooking from scratch, specific events, family composition, geography and seasonality. Topics may also be associated with alcohol/fat/salt/sugar consumption.

Figure 9 displays topics that exhibit a variety of shopping motivations, i.e., diet orientation, cook from scratch, a specific event, etc. For instance, Figure 8(a) presents the organic foods topic, which is composed of organic products that belong to different product categories (produce, dairy, eggs, etc.). The organic topic along with vegetarian-friendly foods and free-from lactose/gluten foods indicate diet orientation. Figure 8(b) is an example of cooking from scratch, in this case, Italian dish. This topic and other topics such as Asian stir fry, Mexican or Indian curry not only show a preference for a specific type of cuisines but also express the shopping motivation of cooking at home. Figure 8(c) shows the roast dinner which is a traditional British main meal that is typically served on Sunday. Other event-specific topics manifest customers’ motivations such as baking, having a picnic, buying a gift (flowers and chocolates), or having a party (spirits and ice cubes). In these examples, topics display products that together fulfill customers’ motivations. Identifying these products and their combinations has valuable commercial implications such as improving product recommendation, developing marketing campaigns, optimizing assortments and shelf space, etc.

Topics also reveal customer motivations that are driven by family composition, geography, and seasonality. For instance, Figure 8(e) displays types of products such as baby wipes, baby foods, whole milk and power milk which are related to baby goods. Likewise, a topic with large size products such 6-pint bottled milk shows a large household; and topics with ‘cat food’ and ‘dog food’ may show having a pet within the household. Topics also exhibit specific shopping themes that are driven by products that are available or highly preferred in certain locations or in specific times of the year. For example, Figure 8(d) reveals the Scottish foods topic which contains Scottish-branded products. Similarly, a Northern Irish topic includes packed and locally supplied foods. Festivities are also revealed by topics. For instance, Figure 8(f) shows the Christmas sweets topic which is mainly characterized by mince pies and chocolate tubs. Likewise, Easter and Halloween are also depicted by topics that contain the icons: chocolate egg and pumpkin, respectively. Commercially speaking, identifying and understanding decision drivers aid further customer analysis such as customer segmentation and customer profiling, to optimize customer experience and build brand loyalty, to customize promotions by location or festivity, etc.

Our approach allows us to provide measures of uncertainty for each inferred topic. For example, the organic food and Italian dish topics appeared 20 times in the 20 LDA realizations. The roast dinner motivation appeared 24 times in the 20 realizations, implying that, for some posterior samples, there were two separate topics that are associated with the roast dinner theme. Therefore, corresponding commercial decisions can be made with relative confidence in these shopping concepts. On the other hand, the baby goods theme only appeared 10 times and the Scottish food theme appeared 12 times in the 20 posterior samples, implying that these topics show much higher uncertainty and that they are not always represented in posterior samples.

Understanding grocery consumption not only assists marketing practices but also opens up new avenues for social research. For example, dietary studies that link eating habits and people’s health (aiello2019large; Einsele2015ASA; wang2014fruit; wardle2007eating) are typically limited to survey data such as food frequency questionnaires and open-ended dietary assessment. Alternatively, uncovering consumption patterns related to alcohol/fat/sugar/salt through topic modeling is scalable, low-cost and allows the identification of specific products and their characteristics. For example, the topics: beer, meal deal and biscuits as described in Figures 8(g), 8(h), 8(i), can be further explored by analyzing its topical composition among baskets. Other topics of similar interest show processed meat, poultry, confectionery, crisps, snacks, wine, spirits, and sugary fizzy drinks. Similarly, topic models can help analyze eating behaviors, which help to show cultural and social changes. For example, topic model composition over time can reveal trends in attitudes to food such as healthy eating, budget meals and multi-cultural influences.

9 Conclusions

In this paper, we expanded the evaluation process of LDA to include qualitative aspects such as topic coherence, topic distinctiveness, and topic stability along with model generalization. In addition, we proposed a methodology that post-processes LDA models, to summarize the entire posterior distribution of an LDA model into a single set of topical modes. Our approach identifies recurrent topics using meaningful distance criteria and allows the user to augment topic stability without sacrificing model generalization. The distance criteria were developed through a customized survey that we carried out with experts in the field of grocery retailing; these helped us evaluate and set thresholds that assist the evaluation of topic coherence and topic similarity of grocery retail topics. We assessed the performance of LDA realizations and called attention to the weaknesses of automatically generated LDA topics in the domain of our application. Moreover, we empirically showed the advantages of the proposed methodology in terms of capturing topic uncertainty and enhancing coherence and stability. We identified stable and coherent topics that exhibit a variety of shopping motivations, i.e., diet orientations, cooking from scratch, specific events, family composition, geography, seasonality, etc. Topics can be associated with alcohol/fat/salt/sugar consumption which may provide new venues for sociological research. Finally, our methods focused on the context of LDA models. Summarizing multiple posterior realizations from a mixture model, however, is a challenge that extends beyond LDA. Our methods can be implemented beyond LDA by replacing the cosine distance by other measures relevant to each context.