Product search refers to the user intent of accessing product-related information from the web. A number of these searches are geared for online shopping or accessing e-commerce websites. As per a recent study (Manufacturing.Net), out of individuals in the U.S. purchased at least product online in with a total spending of billion. According to various surveys111https://blog.survata.com/amazon-takes-49-percent-of-consumers-first-product-search-but-search-engines-rebound, Amazon.com accounts for half of the product search queries in the web. However, traditional search engines are still popular among internet users when they do not have a specific purchase in mind. For instance, nearly of such users will start on search engines compared to for Amazon. Therefore, it is useful to characterize and better understand user intent for product search on the web.
Query understanding for general web search and user intent has been extensively studied in the past for domains like healthcare (inproceedings), developer behavior (bansal2019usage), employment (chancellor2018measuring) and security (bansal2020studying). One of the earliest works in this space is by Broder (article) who proposed a taxonomy to categorize web search into Navigational, Informational and Transactional queries. This work was followed up by Rose and Levinson (Rose04understandinguser) who created a framework for understanding user goals. In the e-commerce space, Moe et al. (articleMoe) perform a study to better understand online shoppers’ behavior by classifying their intents into categories like buying, browsing, searching and knowledge-building, etc. This work is followed by Su et al. (su2018user) who worked on building a taxonomy of product-related search queries by conducting a controlled user study within an e-commerce website. There has also been some work done on improving product search within e-commerce websites (long2012enhancing; guo2018multi) primarily with regards to the purchase intent. In contrast, our work focuses on general web search queries with a broader set of applicable user intents.
In this work, we want to answer the following questions:
What percentage of web search queries are related to product search?
What are the different user intents for product search?
How does product search intent correlate with search metrics like popularity, success and dwell time, and what are their underlying characteristics?
A study (li2008learning) performed a decade back shows that approximately 5%-7% of the distinct web search queries contain product intent. With RQ1 (discussed in Section 3.1), we want to validate if that still holds in recent times given the change induced by e-commerce search engines. Traditional web search queries have intents like Navigational, Informational and Transactional (article). With RQ2 (discussed in Section 3), we investigate if product search has similar or additional intents. With RQ3 (in Section 4), we want to study the characteristics and distribution of different product search intents.
Consequently, we analyze user search behavior from a random sample of 1 million search queries collected from the first week of September 2019 from a popular web search engine. A large number of user queries are found to be not related to products upon inspection. Therefore, we develop a machine learning model to identify product-related queries. We then build a product search intent taxonomy following open coding technique and human annotation as used in prior works (jansen) for web search intent categorization. Finally, we develop an intent classification model to further classify the product-related queries to one of several intent types leveraging various query log attributes as features. We apply our models to large-scale web search query logs to find the distribution of different intents and analyze how they correlate with several search metrics. We discover several interesting insights like Transactional queries are the most popular whereas Navigational queries have the highest rate of success (in Section 4).
2. Product Query Classification
In this paper, we want to analyze and characterize how web search is used for product search. In order to do so, we first need to distinguish product search queries from general search queries. One possibility is to manually label a large number of queries to find names of different products and variations. This is unlikely to give a lot of true positives given that only 5%-7% of the distinct web search queries are likely to be product queries from prior research (li2008learning). To alleviate this problem, we leverage various search engine features to identify product-related queries that can be used to train machine learning models to identify similar related queries.
Distant supervision heuristics
: We explore and evaluate the following heuristics to automatically collect a large amount of positive examples for product-related queries to train the classifiers. Negative examples for non-product queries are randomly sampled from the remaining query set.
Product Ads: Most of the major search engines such as Google and Bing allow advertisers to create Ad campaigns (GoogleShopping) for queries related to products. We label those queries as product-specific for which one or more product Ads were displayed. Note that this technique implicitly uses the Ad recommendation and ranking algorithm of the search engine.
Product Categories: We consider another heuristic that relies on the list of product categories from Amazon.com (AmazonList). These categories (such as, automotive, books, etc.) cover the vast majority of products available therein. We label each query as product related if the search query or the click url contains any of the product category names.
Product Ads and Categories: This uses a combination of the previous two heuristics.
Product list: Finally, we also use the top 10 best selling products of all time 222https://time.com/92765/10-best-selling-products-ever/. We classify any query where the query text or the click url contains any of these products as a positive example.
Dataset: We want to select the best heuristic from the above list to generate our training data. Subsequently, we randomly sampled search queries from the query logs and manually labeled them as product or non-product queries. We found that 17.4% of these queries were product queries. Table 1 presents the evaluation of all the heuristics on our manually annotated query set. We observe that the Product Ads and Categories heuristic is the most accurate one with -score for the product class and accuracy overall. Now, we generate the training data by sampling queries that satisfy this heuristic as positive examples and another queries that do not as negative examples.
Product query classifier: The search query and click urls from the training dataset are tokenized and pre-trained Word2Vec (Mikolov:2013:DRW:2999792.2999959)
embeddings are used to generate distributed representations of these features. Both the query and the click urls are represented in adimensional embedding space. Word embeddings have been shown to improve performance in many classification tasks (lilleberg2015support; tang2014learning). We train several classifiers and report five fold cross validation results in Table 2
. Among all the classifiers, Multi-Layer Perceptron obtain the best performance with an-score of and used for the remaining analysis.
|Ads & Categories||74||68||71||91|
3. Product Query Intent Taxonomy
Users can have different intents behind product-related searches. They might be looking for some information about a product, compare them to other similar items, make a purchase and so on. In this section, we study the various user intents for product search.
3.1. Dataset generation
We use the best performing product query classification model (multi-layer perceptron) on a random sample of 1 million search queries, collected from the first week of September 2019, and obtain product-related queries which implies that approximately of all search queries in the web are product searches. We found this number to be around on the manually labeled set of randomly sampled queries in Section 2. Our findings demonstrate that product search queries account for which is significantly higher than the study (li2008learning) from a decade ago which found the volume to be . For the next phase of analysis, we use the above classified product-related queries along with other query features from the search engine like the click urls, click snippets, click counts, etc. that provide more context about the search intent.
Not Product Related
|Queries which are not product related.||‘apple stock price’, ‘homedepot jobs’, ‘walmart customer relations address’|
|Queries for comparing products in a category or based on attributes such as features, price, etc.||‘skype vs microsoft teams’, ‘ironman action figures’, ‘compare iphone x and xr’|
|User is looking for some specific product-related information.||‘what does apple care cover on watches’, ‘sprinkler system design’, ‘where to buy battens?’|
|User wants to navigate to a specific website associated with a product.||‘itunes sign in’, ‘amazon chime’, ‘apple care’, ‘myamazon/kdp’, ‘alexa apps’|
|Here the user is looking for help to troubleshoot some product specific issue.||‘hp mouse not working’, ‘cannot connect to display mininet’, ‘access denied amazon aws’|
|Such queries depict that the user is interested in transactions like buying, downloading or installing a specific product.||‘antique german wind up car’, ‘qvc gold jewelry clearance diamonique’, ‘h.h. scott S 10 speakers for sale’|
3.2. Clustering and sampling for annotation
Given a large number of queries from the previous step, we want to sample a set of queries for manual annotation. Random sampling of queries result in the risk of missing those from less popular or scarcely appearing intents. To explore queries from diverse intents, we leverage Latent Dirichlet Allocation (LDA) (blei2003latent) to cluster textually similar queries and sample queries from each cluster.
LDA assumes a document (query in our case) to have a distribution over latent topics (intent in our case), and each topic to have a distribution over words. We use wordpiece tokenization (DBLP:journals/corr/WuSCLNMKCGMKSJL16) to tokenize the query string, clicked urls and click snippets to get all the keywords associated with the query that are concatenated to form the input document for the LDA model. The clicked urls and snippets provide additional context to the query. The tokenization considers a fixed vocabulary size (e.g.,
tokens) and segments words into wordpieces such that every piece is present in the vocabulary. A special symbol ‘##’ is used to mark the split of a word into its sub-pieces. We remove the domain information from the feature set to prevent trivial clustering based on domain name. The length of the feature vector isgiven by wordpiece vocabulary size. We set the number of latent topics to . We then randomly sample queries from each topic cluster resulting in samples that are used for manual annotation.
3.3. Manual annotation
Three annotators manually inspected all the queries collected in the previous step along with all the relevant contextual features. Each annotator was asked the following questions for each query:
Is the query related to a product?
For product-related queries, does any of the following intent categories apply, Informational, Navigational, Transactional?
Is there any other intent for product-related queries that is not included in the above intent categories?
The intents corresponding to Informational, Navigational and Transactional are commonly observed in web search queries (article). In addition to those, the annotators observed (i) a large number of product queries for comparing various products based on attributes such as features and price, and (ii) a large number of support-related queries where the user is looking for help to troubleshoot some product specific issue. Given these observations, the annotators were asked to re-annotate such queries with the intent categories corresponding to Comparison and Support. We ignored queries in languages other than English. This resulted in five intent categories for product-related queries. Table 3 shows a detailed description for each intent category with examples and key indicators from our analysis. Introducing product specific intents like Comparison and Support can help improve the quality of search results for product queries. We computed the inter-annotator agreement using Fleiss kappa (fleiss1971measuring) as indicating substantial agreement among the annotators for different intent categories. Note that the distribution of intents in Table 3 is computed over the manually annotated set of queries and does not reflect the true distribution of intents in the wild. An analysis of the true distribution of intents has been carried out in Section 4.
In Table 4, we show a comparison of user search intent taxonomies as developed in prior works including general web search (article; Rose04understandinguser), product search within e-commerce search engines (su2018user), decision making by consumers for buying goods and services (BuyerDecisionProcess) and search in the context of interactive applications (Fourneyinproceedings). We observe significant overlap in the taxonomy of all these search intents. Our newly introduced product search intent categories, namely comparison and support overlap with the evaluation of alternatives and post purchase behaviour from the buyer decision process (BuyerDecisionProcess).
|Broder et al. (article)||Informational, Navigational, Transactional|
|Buyer Decision Process (BuyerDecisionProcess)||Problem/Need Recognition, Information Search, Evaluation of Alternatives, Purchase Decision, Post Purchase Behavior|
|Su et al. (su2018user)||Target Finding, Decision Making, Exploration|
|Rose et al. (Rose04understandinguser)||Navigational, Informational (Directed, Undirected, Advice, Locate, List), Resource (Download, Entertainment, Interact, Obtain)|
|Fourney et al. (Fourneyinproceedings)||Operation Instruction, Troubleshooting, Reference, Download, General Information, Off-Topic|
|This work||Informational, Navigational, Transactional, Comparison, Support, Not-product-related|
3.4. Intent Classification
Given the set of manually annotated queries with product search intents, we aim to train machine learning models to automatically perform intent classification of user queries.
Features and preliminary analysis: We use the following query log features for intent classification. Figure 1 shows a distribution of these features over different intent categories in our labeled data.
Query embeddings: Average of the individual (wordpiece tokenized) word embeddings from pre-trained Word2Vec (Mikolov:2013:DRW:2999792.2999959) model.
Url embeddings: Similar to query embeddings, we tokenize and aggregate the word embeddings for the clicked urls.
Click Count: Number of urls the user has clicked on. From Figure 1, we observe that Transactional queries mostly have a single click depicting a crisp user objective. Navigational queries have one or two clicks where the user can easily recognize the website they want to navigate to from the suggestions. On the other hand, Support, Comparison and Informational queries tend to have a much higher click count since these intents are more exploratory in nature.
Query length: Number of tokens present in the query string. From Figure 1, we observe that Navigational queries are significantly shorter in length; whereas queries with Support intent are significantly longer as they tend to be more descriptive.
Snippet token count: Number of tokens present in the click snippets. From Figure 1, we observe that the number of snippet tokens is much higher for Support, Comparison and Informational queries, since the clicked websites have much higher textual content when compared to Navigational and Transactional queries.
Url domain count: Number of unique domains in the clicked urls. From Figure 1, we observe that the count is at most two for Navigational queries as the user knows which website they want to navigate to. The url domain count is much higher for Comparison queries as the user compares the product feature or price across multiple websites. Similarly, Informational and Support queries are more exploratory in nature.
Similarity: Ratio of unique tokens that are present in the query string to those in the clicked urls. This ratio is higher for Support queries where the query and clicked urls are more descriptive of the issue the user is facing. Similarly, for Comparison queries most of the query tokens are contained in the clicked urls. Navigational and Transactional queries tend to be much shorter resulting in a low ratio. Similarly, Informational queries are more exploratory in nature and have a low ratio.
Model: For each intent, we use of the labeled data for training, for test and report five fold cross validation results. We train different classification models and evaluate them based on accuracy, precision, recall and scores, as reported in Table 5
. We observe that Linear Support Vector Machines perform the best across all the classes with an overall accuracy ofand score of .
4. Query Intent Analysis
We apply the best intent classification model (linear SVM) from the previous section to analyze the product queries (described in Section 3.1
) based on several measures like popularity, success rate and effort estimation. We also do a session-wise analysis to gain insights on what intents co-occur within a given session.
Intent success rate: A search is said to be successful if the user spends more than 30 seconds on the clicked url page as per a study in (fox2005evaluating). Since a user may click on multiple urls during product search, we consider a search to be successful if the dwell time on the last clicked url is more than 30 seconds. From Table 6, we observe that Navigational queries have the highest success rate of 77.28% whereas Comparison queries are least successful.
Intent popularity is calculated as the percentage of product queries having a specific intent out of all query intents. From Table 6 we observe that the most popular intent is Transactional followed by Navigational; whereas Comparison and Support queries are less popular ones.
Intent Effort Estimation: The effort taken to complete a search is proportional to the dwell time. The values are then transformed to be relative to the comparison intent. From Table 6, we observe that the effort put into Comparison, Support and Transactional query is much less compared to that for Informational and Navigational query. This results from the former intents (e.g., queries like ‘skype vs microsoft teams’ and ‘iphone x price’) being more task-specific compared to the latter exploratory intents (e.g., queries like ‘sprinkler system design’ and ‘alexa apps’).
|Intent||Success Rate||Popularity||Estimated Effort|
Intent Co-occurrence: We analyzed the temporal co-occurrence of different intents in a given session aggregated across all the sessions. From Figure 2 we observe that while less than % of Comparison and Support queries are preceded by other intents, % of the Transactional queries are preceded by Comparison queries indicating that many users indeed compare products before making a purchase. Comparison is also often followed by Informational (%) queries depicting that the user is looking for more information about a product after shortlisting the choice. Another interesting insight we observed is that Support queries are commonly followed by Informational (%) and Transactional (%) queries indicating that users seek additional information to help understand the solution (i.e. troubleshoot) or purchase a replacement for the product.
In this work, we performed an extensive study to characterize product search queries and their intents from query log analysis of Bing web search engine. This is in contrast to prior work that heavily focused on e-commerce search engines with limited diversity of product search intents. We developed the product query intent taxonomy for web search and trained machine learning models with a judicious selection of features from search engine and query context to automatically identify product search intents. We applied our intent classification models to large-scale query logs and reported our findings from the intent distribution analysis.