Answer Ranking for Product-Related Questions via Multiple Semantic Relations Modeling

06/28/2020 ∙ by Wenxuan Zhang, et al. ∙ The Chinese University of Hong Kong 0

Many E-commerce sites now offer product-specific question answering platforms for users to communicate with each other by posting and answering questions during online shopping. However, the multiple answers provided by ordinary users usually vary diversely in their qualities and thus need to be appropriately ranked for each question to improve user satisfaction. It can be observed that product reviews usually provide useful information for a given question, and thus can assist the ranking process. In this paper, we investigate the answer ranking problem for product-related questions, with the relevant reviews treated as auxiliary information that can be exploited for facilitating the ranking. We propose an answer ranking model named MUSE which carefully models multiple semantic relations among the question, answers, and relevant reviews. Specifically, MUSE constructs a multi-semantic relation graph with the question, each answer, and each review snippet as nodes. Then a customized graph convolutional neural network is designed for explicitly modeling the semantic relevance between the question and answers, the content consistency among answers, and the textual entailment between answers and reviews. Extensive experiments on real-world E-commerce datasets across three product categories show that our proposed model achieves superior performance on the concerned answer ranking task.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Question:   Will these work with Android?
Relevant Review Snippets:
- I have a Samsung Galaxy Note 4, I absolutely love these headphones, totally worth what I paid for them.
- Easy to sync with an iPhone/Android phones.
- Waste of money if using with an android device
- I use mine with an android device and still works great.
- ……
A1: The will and despite what some had said, they will sound exactly as they do on an Apple device.
A2: Yes they will work with any Bluetooth capable device.
A3: They will work but they won’t sound as proficient as it would with an apple product.
A4: I have Samsung Note 5, the headphones cannot connect via Bluetooth.
A5: Its really not worth the money when using with an android.
Table 1. An example question of a headphone product from Amazon accompanied with its multiple answers. There are also some relevant review snippets of the question.

Appropriately addressing users’ concerns during online shopping can greatly improve their shopping experience and stimulate the purchase decisions. To this end, many E-commerce sites such as Amazon and eBay now offer product-specific community question answering platforms for users to post their questions and answer existing questions. Thanks to the convenience of such platforms, a question can typically get multiple user-provided answers. However, these answers, similar with other user-generated content, vary a lot in their qualities (Suryanto et al., 2009; Jeon et al., 2006) and suffer from typical flaws such as spams or even malicious content from the competitors.

Table 1 presents an example question, as well as its user-provided answers ranked by the community votes. It can be observed that even for this relatively objective question, the answers can vary diversely. Such variation in the answer contents and qualities motivates the task of automatically ranking these answers to improve user satisfaction. As shown in the example, there usually exists some relevant product reviews for the concerned question, which can provide useful information when ranking the answers if they are effectively utilized. Thus in this paper, we aim to tackle the task of answer ranking for product-related questions, with the associated product reviews treated as auxiliary information which can be exploited for assisting the ranking.

Answer selection methods have been extensively studied for tackling the answer ranking problem in retrieval-based question answering (QA) systems (Tan et al., 2016; Nakov et al., 2017; Wang et al., 2007; Eric et al., 2017). Most of the existing works focus on measuring the semantic relevance between the question and a candidate answer, where the negative answers for training are usually randomly sampled from the whole answer pool (Qiu and Huang, 2015; Tay et al., 2017; Deng et al., 2019) or chosen from irrelevant documents (Yang et al., 2015). However, as can be observed from the above example, merely measuring the semantic relevance between the question and answer texts is no longer sufficient for the concerned answer ranking task in E-commerce settings since all of the answers are written specifically for the question and thus most of them are supposed to be topically relevant to the question. Moreover, existing general answer selection models lack the capability of making use of the relevant product reviews during the ranking process. Recently, Zhang et al. (2020) attempt to utilize reviews for identifying helpful answers in E-commerce. However, they ignore the relations among answers and reviews, which is essential for the concerned answer ranking problem. On the other hand, some product-related question answering (PQA) methods explore the utilization of product reviews to provide responses for a given question (Yu and Lam, 2018; Chen et al., 2019b, c; Gao et al., 2019), where some selected reviews can be served directly as the response (Yu et al., 2012; Chen et al., 2019b) or used to generate a response sentence based on a sequence-to-sequence neural model (Chen et al., 2019c; Gao et al., 2019). These PQA methods assume the lack of user-written answers and thus turn to product reviews for help when addressing the given question. Such assumption neglects the large number of available answers provided by former buyers (Yu et al., 2018; Carmel et al., 2018), which can better answer the given question than the responses produced from the reviews.

In E-commerce settings, a key issue for the concerned ranking task is what makes an answer be a good one and how we can characterize such good answers via exploiting the associated reviews. Specifically, appropriately ranking the user-provided answers requires to model the complex semantic relations among these existing information sources, i.e., the question, answers, and product reviews. We argue that three kinds of semantic relations attach great importance: (i) Firstly, similar to general answer ranking problem (Rao et al., 2019), the semantic relevance of the answer content to the question is still essential for determining the ranking. At the same time, it is also necessary to consider the relevance between the question and reviews, which can alleviate the noise from irrelevant review information. (ii) Secondly, the textual similarity between each pair of answers indicates their content consistency, which can be regarded as a notion of peer reviews among the entire answer set for verifying the reliability of each answer (Tymoshenko and Moschitti, 2018). Returning to the above example, since more answers agree that ”the headphone is compatible with Android devices”, the answer with the opposite opinion, such as A4, is thus less reliable than others. Similarly, measuring the similarity between reviews also helps cross verify the content consistency among them and hence captures the crowd’s common opinions reflected in the review set for the given question. (iii) Thirdly, one may notice that such common opinions from the reviews often reveal authentic and general judgement from the whole community. Thus, the relationship between an answer and reviews can be modeled by the textual entailment relation (Bowman et al., 2015; Chen et al., 2017), which examines whether the opinion holding by an answer is coherent with common opinions reflected in the reviews. As shown in Table 1, the review snippets reveal that ”the concerned headphone can work with Android”, indicating that the first three answers, i.e., A1, A2 and A3, are more consistent with the opinions from the whole community, which leads to a higher rank. In summary, how to model and utilize the aforementioned multiple semantic relations among the question, answers, and relevant reviews for ranking the answers poses a main challenge.

In this paper, we propose an answer ranking model named MUSE for product-related questions, which comprehensively models multiple se

mantic relations among the question, answers, and relevant reviews. Concretely, we first conduct a word-to-word attention mechanism from the question to each individual answer during the answer encoding phase. Then the important information in the answer text and the relevance information with the question can be highlighted to obtain the textual features for each answer. Next, we construct a multi-semantic relation graph with the question, each answer and each review snippet as nodes. Then a customized graph convolutional neural network is designed for explicitly modeling the interrelationship between the nodes under multiple semantic relations. Precisely, for a specific answer, the textual features obtained in the earlier step are further refined by considering the semantic relevance with the question, the textual similarity with other answers and the textual entailment relation with relevant reviews. By modeling the relations between a given answer with other answers and relevant reviews, the coherence information between the concerned answer with the common opinion is accumulated to assist the ranking. Finally, we adopt a joint loss function, combining both pointwise and listwise learning approaches, to consider a specific answer both locally and globally in the entire answer set. To summarize, the contributions of this paper are as follows:

  • [leftmargin=*]

  • We investigate the problem of ranking user-provided answers in E-commerce and propose a framework to jointly model multiple semantic relations including the semantic relevance between the question and answers, the textual similarity among answers, and the textual entailment between answers and reviews.

  • We model both textual and interaction features of each answer to facilitate the ranking task. Importantly, a novel graph convolutional operation is designed to integrate the coherence information under different semantic relations.

  • Experimental results on real-world E-commerce data across three product domains show that our proposed MUSE model achieves superior performance on the concerned task.

2. Related Work

Answer Ranking.   Answer selection has been extensively studied for solving the answer ranking problem in retrieval-based question answering systems as exemplified in the community question answering (CQA) (Nakov et al., 2017) and factoid question answering (Wang et al., 2007). The research on answer ranking has evolved from early information retrieval (IR) research, which primarily focused on feature engineering with syntactic or lexical approaches (Wang et al., 2007; Severyn and Moschitti, 2013)

. In recent years, deep learning based answer selection methods make several breakthroughs and become the mainstream approach to tackle the answer ranking task. Most of the existing neural models adopt Siamese architecture 

(Severyn and Moschitti, 2015; Rao et al., 2017), attentive architecture (Tan et al., 2016) or compare-aggregate architecture (Wang et al., 2017; Wang and Jiang, 2017; Yang et al., 2019) for modeling the semantic relevance between the question and answer without heavy feature engineering. Additionally, some latest studies also learn to rank question-answer pairs from different perspectives such as utilizing external knowledge (Deng et al., 2018), extracting length-adaptive features (Shao et al., 2019), modeling user expertise (Lyu et al., 2019), and measuring answer novelties (Omari et al., 2016; Harel et al., 2019).

There are also some works focusing on measuring the quality of answers or similar text content, then the predicted qualities can be used to rank them. Shah and Pomerantz (2010) evaluate and predict answer qualities in the CQA platforms. Halder et al. (2019) propose a neural model to predict the quality of a response post to the original post, with the awareness of several previous posts in the discussion forum. In the E-commerce scenario, some studies (Fan et al., 2019; Chen et al., 2019a) utilize product information such as product titles to predict the quality of a customer review. Recently, Zhang et al. (2020) utilize user reviews for identifying helpful answers in the product question answering forums while they neglect the interrelationships among answers. In this work, we focus on the answer ranking problem in E-Commerce settings, where we wish to rank multiple user-provided answers for a product-related question with the help of relevant reviews.

Product-related Question Answering.   Recent years have witnessed several successful applications in product-related question answering (PQA) problem. Most of the existing studies exploit customer reviews as major (Yu et al., 2012; Yu and Lam, 2018; Chen et al., 2019b) or auxiliary resources (McAuley and Yang, 2016; Wan and McAuley, 2016; Chen et al., 2019c; Gao et al., 2019) for providing responses to the given question. McAuley and Yang (2016) divide the question type into yes/no questions and open-ended questions and then tackle the yes/no type question as a classification task, aiming to predict the answer as ”yes”, ”no” or ”unsure” with the help of reviews. Following this direction, Yu and Lam (2018) further consider the latent aspect information to improve the answer prediction performance. Some other works (Yu et al., 2012; Chen et al., 2019b) adopt the retrieval-based methods to retrieve certain review snippets serving as the response. For example, Chen et al. (2019b) propose a multi-task learning framework to identify reviews for a given question. Recently, some studies utilize the reviews to generate a sentence as the response based on the sequence-to-sequence architecture (Chen et al., 2019c; Gao et al., 2019). Although most of these models utilize the review information, it is often assumed that the user-written answers are unavailable, which is different from our concerned task to directly rank these answers for the given question.

Graph Neural Networks.   Graph Neural Networks (GNNs) (Kipf and Welling, 2017) have been widely adopted to model graph structure data. Some latest studies exploit GNN in the IR-related tasks, which constructs text-based graphs to model the structural relation beyond the context itself. Li et al. (2019) propose a large-scale anti-spam method based on GCN for detecting the spam advertisements. Sun et al. (2019) propose a GCN encoder for keyphrase extraction that can effectively capture document-level word salience. Chen et al. (2019d) develop heterogeneous graph attention networks (HGAT) for user profiling. In this work, we explore the utilization of relational GNN (Schlichtkrull et al., 2018) to model the interactions between information from different sources under different semantic relations in E-commerce settings.

3. Model

Figure 1. The architecture of the MUSE model composed of three main components, namely textual feature modeling, multiple semantic relations modeling, and answer ranking. Three types of semantic relations are specifically considered, including the semantic relevance () between the question with answers and with reviews, the textual similarity () among answers and among reviews, the textual entailment () between answers and reviews.

In typical E-commerce settings, given a product-related question of a particular product, its answer set contains human-written answers. We can also obtain relevant review snippets to the question . Our goal is to rank those answers in the answer set with the review snippets treated as auxiliary information that can be exploited to assist the ranking.

3.1. Model Overview

In this section, we introduce our proposed answer ranking model, MUSE, with modeling of multiple semantic relations among the question, answers, and relevant reviews. As shown in Figure 1, MUSE consists of three main components: textual feature modeling, multiple semantic relations modeling, and answer ranking.

We first employ a word-level attention to attend important and relevant information in the answers from the question during the textual feature modeling. Then a multi-semantic relation graph is constructed to model the multiple semantic relations among those texts from diverse information sources. Correspondingly, a customized graph convolutional network is developed to obtain the interaction-based features for each answer by aggregating the semantic relevance information between the question and answers, the textual similarity information among different answers, and the textual entailment information between the answer with each review snippet. Finally, after obtaining the textual features and interaction features, we design a joint loss function combining pointwise and listwise learning approaches to rank multiple answers.

3.2. Textual Feature Modeling

3.2.1. Context Modeling

Given a text sequence, which can be either the question sentence , an answer sentence , or a review snippet , we first map each word in the sequence to a

-dimensional dense vector which can be initialized with pre-trained word vectors. We denote the embeddings for the word

as . To model the context interactions among words in the sequence, a bi-directional LSTM encoder is then employed to transform each word into a context-aware vector representation:


where is the hidden state of the Bi-LSTM encoder at the -th time step. We thus denote the representation of the question, the -th answer and the -th review snippet after such context-aware encoding as , and respectively:


where , and denote the sequence lengths of the corresponding text sequences, is the dimension of the hidden state of the Bi-LSTM encoder. To avoid notational clutter, we will omit the index of the answer and review snippet in this section since the same operations are conducted for each answer and review respectively. For example, we will use to represent the context-aware representation for one particular answer instead of .

3.2.2. Question-attended Answer Encoding

To explicitly highlight the core semantic units in the answer sentence and their relevance with the question, we employ a word-to-word attention mechanism to attend the important information in the question to each word of the answer. Specifically, for the -th word in , we consider its similarity with every single word in the question:


where denotes the similarity scores of the -th word in the answer with every word in the question, is the normalized importance weight between the -th word in the answer and the -th word in the question. Then for each word in the answer, we obtain a question-attended representation as a weighted sum of the embeddings of question words. Next we concatenate the context-aware answer representation and the question-attended answer representation for the -th word to obtain an enriched answer representation:


where and

are trainable parameters, [;] denotes the concatenation operation. A max-pooling operation is then employed to obtain the encoded vector representations

and for the answer and the question respectively:


We denote the textual features obtained from the above operations for the -th answer as .

3.2.3. Clip-Rescale Attention for Review Encoding

For the relevant reviews, we also obtain encoded review representations as auxiliary information for assisting the ranking of answers. Although the review snippets in are typically obtained with an initial retrieval process, there is still much noise contained in them, since these product reviews are originally written without explicitly responding to any question. To prevent the irrelevant information distracting the encoding, we employ a more aggresive clip-and-rescale attention mechanism inspired by (Bian et al., 2017) to obtain the question-attentive representation for each review snippet:


where and are trainable parameters, denotes the context-aware representation for a review snippet , contains the original attention weights for each word in the review, denotes a mask vector for where only the index whose corresponding weight score are among the top in will be , and otherwise, denotes the element-wise vector multiplication. refers to the vector rescale operation: . Hence refers to the importance score of the -th word in the review snippet and is forced to be 0 for those unimportant words. Then we compute the review representation as the weighted sum of its context-aware representation:


The same operations introduced above are conducted for every review snippet. We thus denote the vector representation after such textual feature modeling for the -th review as . Following similar notation convention, we denote the question representation as .

3.3. Multiple Semantic Relations Modeling

3.3.1. Multiple Semantic Relations

Effectively ranking the answers requires to exploit the rich semantic relations among the question, answers, and reviews. To this end, we identify three types of semantic relations among these diverse information sources which are useful for ranking the user-provided answers:

(1) Semantic Relevance between the question and answer text is typically exploited in the general answer ranking task (Rao et al., 2019). After bringing in review information, measuring the semantic relevance between the question and review snippets is also useful for alleviating the noise from irrelevant review information.

(2) Textual Similarity between each pair of answers can effectively measure their content consistency (Tymoshenko and Moschitti, 2018) and hence help identify the core opinions in the entire answer set for the given question. Similarly, considering such relation among review snippets in the review set can reveal the common opinions reflected in the reviews.

(3) Textual Entailment relation between an answer and a review snippet indicates whether the answer is supported by that specific review, which is inspired by some attempts of utilizing textual entailment relation for general question answering problem (Harabagiu and Hickl, 2006; Trivedi et al., 2019). Concretely, we treat the review as external evidence to examine the opinion coherence of a given answer with the common opinions of the community.

To model these different semantic relations, especially capturing the coherence information of an answer with other answers and user reviews, it requires to aggregate the complex interactions among the existing information sources. Importantly, we can observe that these relations are closely connected and supposed to be modeled concurrently when ranking the answers. For example, each answer needs to be considered with different purposes when measuring its relation with the question, other answers, and review snippets. The coherence information from one relation can also affect its interaction with another information sources under different relations. Therefore, we propose a multi-semantic relation graph and utilize the graph convolutional networks (GNN) (Kipf and Welling, 2017), which is shown to excel at aggregating the structural information from the neighborhoods, to capture the coherence information under different semantic relations.

3.3.2. Graph Construction

Formally, we denote an undirected graph with multiple semantic relations as , with nodes , labeled edges (i.e., semantic relations) between node and as , where is the relation type between two nodes. Then to construct the graph, we treat the question , each answer sentence and each review snippet as a node in . The total number of nodes is thus . We initialize each node with their corresponding textual features obtained from the textual feature modeling, which are encoded with their core semantic information.

To represent the multiple semantic relations, we make use of different adjacency matrices for the graph . Specifically, the relation type between two nodes , which represents the semantic relevance, textual similarity, and textual entailment relations respectively. Three adjacency matrices can thus be constructed for :


3.3.3. Coherence Information Aggregation

Motivated by the Relational GCN (Schlichtkrull et al., 2018)

, which shows good performance when considering the multiple relations between entities in a knowledge graph for the link prediction task, we develop a novel architecture for modeling the multiple semantic relations among the question, answers, and reviews for the concerned task. For a node

, the opinion coherence information is aggregated from its neighboring nodes:


where is the hidden state of the node at the -th layer of the network, denotes the neighboring indices of the node under the relation , and are trainable parameters representing the transformation from neighboring nodes and from the node itself. is a normalization constant such as in (Schlichtkrull et al., 2018). To avoid the scale changing of the feature representation as commonly observed to be harmful for the performance, we apply a symmetric normalization transformation:


where is the adjacency matrix under the relation , is the corresponding degree matrix of .

Unlike the basic GCNs using one convolutional filter matrix to model the feature transformation, the aggregation operation in Equation (16) employs different weight matrices for different semantic relations in each layer, which can capture the coherence information explicitly under different relations. Besides, a self-connection weight matrix is utilized to control how much information in the node itself at each update should be kept.

At each step, the representations of the answer nodes are enriched by their neighbourhoods. Specifically, the first layer takes the textual feature vectors obtained from Section 3.2 as the input for each node, i.e. for the question node: , for the node of the -th answer: and for the node of the -th review snippet: . Then the transformation in Equation (16) can be stacked up to layers to include the dependencies across multiple relational steps. We take the output of each answer node at the last layer as their interaction features and denote as the feature representation for the -th answer.

3.4. Answer Ranking

For each answer , after obtaining the textual feature and the interaction features , they are then concatenated and fed to a MLP with one hidden layer to get the final prediction scores:


where is the prediction vector for the -th answer. We denote the concatenation of the prediction vectors of the entire answer set as . Finally, we employ a joint loss function, combining the pointwise and listwise learning approaches, to conduct training for learning to rank the answers.

3.4.1. Pointwise Loss Function

One of the most commonly used training strategies for answer ranking problem is the pointwise learning approach. Specifically, for each answer , a softmax function is applied to its prediction vector to obtain the predicted distribution . Then each answer is considered separately and the cross-entropy loss is computed:



denotes the one-hot encoding of the label for the

-th answer. Then the total loss for each question is the average of the cross-entropy loss of all its answers.

3.4.2. Listwise Loss Function

Another choice of learning approach is the listwise method considering all candidate answers to the given question at the same time. Given the prediction matrix , we first normalize the prediction scores among all answers:


where is the unnormalized prediction score of answer being a positive answer, is set to in the experiments to compute the vector norm. Similarly, we also normalize the whole label list of all answers with where is the raw label for the

-th answer. Then we can compute the listwise loss for a given question with Kullback-Leibler divergence:


3.4.3. Joint Loss Function

The pointwise and listwise learning approaches consider a specific answer locally and globally in the entire answer set respectively. In the pointwise learning approach, we focus on each individual answer locally, and the goal is to accurately predict their corresponding labels. In the listwise method, we examine the entire answer list globally, attempting to differentiate the good and bad answers, and making the former rank higher. In order to combine the strengths of both of them, we propose to employ a joint loss function in this work.

Specially, we combine these two types of loss functions to a joint loss to train our proposed MUSE model. The above introduced two loss functions in Equation (19) and Equation (21) are for one single question (i.e. one data instance) in the dataset. Then for the whole dataset with in total questions, the joint loss function is:


where and are the pointwise and listwise loss functions for the -th question respectively. is a hyper-parameter for balancing between these two loss functions, is the L2 regularizer weight, is the set of all trainable parameters in the model.

4. Experiment Setup

4.1. Datasets

We evaluate our proposed model on Amazon QA dataset (Wan and McAuley, 2016), and utilize three product categories with the largest number of question-answer pairs for evaluation, including Electronics, Home and Kitchen, Sports and Outdoors. The dataset contains questions accompanied with their multiple user-written answers. The product ID of each question is then utilized to align with the Amazon review dataset (He and McAuley, 2016; Ni et al., 2019) for obtaining the corresponding reviews for the question. Since an entire raw review can be lengthy and talks about multiple aspects of the concerned product, each review text is chunked into snippets at the sentence level. Then for each question, we adopt BM25 to rank all the review snippets and collect the top 5 relevant snippets for each question in our experiments.

Similar to previous works (Halder et al., 2019; Zhang et al., 2020), we treat the user votes from the community as a proxy of the gold label for the quality of an answer. Thus, the answer whose number of positive votes is greater than the number of negative votes is treated as a high-quality (positive) answer, otherwise it is treated as a negative one. We split 10% of the dataset for each product category for testing and the rest is used for training and validation. The statistics are summarized in Table 2, including the number of products (# Product), questions (# Q), answers (# A), and positive answers (# Pos A).

Category # Product # Q # A # Pos A
Electronics Train+Val 11,172 15,547 80,115 28,919
Test 1,657 1,727 8,823 3,184
Home Train+Val 8,590 12,731 66,956 24,838
Test 1,349 1,414 7,461 2,801
Sports Train+Val 4,949 6,952 35,858 13,230
Test 746 772 4,065 1,511
Table 2. Statistics of data splits of three product categories.
Electronics Home & Kitchen Sports & Outdoors
(lr)2-5 (lr)6-9 (lr)10-13 MAP MRR P@1 P@3 MAP MRR P@1 P@3 MAP MRR P@1 P@3
BM25 (Robertson and Zaragoza, 2009) 0.571 0.576 0.380 0.383 0.585 0.592 0.406 0.397 0.586 0.589 0.384 0.398
CNN (Severyn and Moschitti, 2015) 0.633 0.668 0.460 0.411 0.638 0.675 0.474 0.407 0.628 0.664 0.452 0.407
aNMM (Yang et al., 2016) 0.619 0.651 0.445 0.386 0.633 0.670 0.465 0.413 0.624 0.659 0.448 0.403
Att-BiLSTM (Tan et al., 2016) 0.642 0.671 0.464 0.408 0.639 0.673 0.471 0.416 0.633 0.665 0.464 0.408
BiMPM (Wang et al., 2017) 0.647 0.678 0.480 0.405 0.656 0.688 0.491 0.425 0.636 0.680 0.482 0.409
HCAN (Rao et al., 2019) 0.643 0.676 0.472 0.412 0.659 0.686 0.492 0.429 0.632 0.666 0.459 0.404
PRHNet (Fan et al., 2019) 0.646 0.677 0.478 0.406 0.649 0.683 0.483 0.421 0.634 0.669 0.469 0.405
PHP (Halder et al., 2019) 0.652 0.679 0.475 0.414 0.648 0.681 0.484 0.421 0.638 0.667 0.463 0.409
MUSE-Pointwise 0.663 0.693 0.504 0.425 0.679 0.710 0.521 0.450 0.649 0.684 0.482 0.431
MUSE-Listwise 0.678 0.715 0.539 0.417 0.675 0.712 0.527 0.443 0.657 0.688 0.491 0.435
MUSE-Joint-Loss 0.695 0.711 0.511 0.450 0.693 0.714 0.518 0.466 0.661 0.694 0.498 0.437
Table 3. Answer ranking results of MUSE and baseline models. denotes that MUSE-Joint-Loss model achieves better performance than the strong baseline PHP with statistical significance test for .

4.2. Baselines and Evaluation Metrics

We compare our proposed MUSE model with some traditional and state-of-the-art methods. To conduct a more comprehensive comparison, we slightly modify some models to take the advantage of utilizing relevant reviews as one of their inputs.

  • BM25 (Robertson and Zaragoza, 2009): It is a widely-used retrieval model for ranking candidate answers given a question.

  • CNN (Severyn and Moschitti, 2015): It employs a CNN-based Siamese network to encode QA pairs for ranking the answers.

  • Attentive-BiLSTM (Tan et al., 2016): It utilizes a bidirectional LSTM as well as an attention mechanism to measure the relevance between the question and answer text.

  • aNMM (Yang et al., 2016): It is an attention based Neural Matching Model, which employs a value-shared weights scheme and a gated attention network to improve the ranking performance.

  • BiMPM (Wang et al., 2017): Bilateral Multi-Perspective Matching is one of the state-of-the-art models in many retrieval based QA tasks. It matches QA sentence pair from multiple perspectives.

  • HCAN (Rao et al., 2019): Hybrid Co-Attention Network is one recent model for modeling short text relations. It combines the semantic matching and relevance matching components to complement each other for better performance.

  • PRHNet (Fan et al., 2019): It is one of the state-of-the-art models for predicting the quality of product reviews. We concatenate the QA pair and treat it as a single review and utilize relevant reviews as the ”product information” in the original model.

  • PHP (Halder et al., 2019): Post Helpfulness Prediction is one recent model for predicting the quality of a replying post to an initial post in the online discussion forum. We treat the question and answer as the original and replying post respectively, and treat reviews as the previous posts used in the model.

For our proposed MUSE model, we use different suffixes to denote the variants of it trained with different learning approaches, where ”-Pointwise”, ”-Listwise” and ”-Joint-Loss” refer to training with the pointwise, listwise, and joint loss function, respectively.

For evaluation metrics, Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), as well as Precision@N (P@N) are adopted to measure the model performance on ranking user answers. We take

and in the experiments for P@N metric, which correspond to two common use cases in E-commerce: P@1 measures the performance when only a single answer is paired with each question in the product main page111e.g., and P@3 measures how well the top three answers for each question in the detailed question page containing all user-provided answers222e.g.,

4.3. Experiment Configurations

For the network configurations, we tuned hyper-parameters with the validation set. Specifically, the hidden dimension of the Bi-LSTM context encoder is set to 100, the hidden size of the weight matrix in Equation (7) is 200, is set to 8 for alleviating the noise in reviews. In the semantic relation modeling, we stack two GCN layers to obtain the interaction features for the answers, i.e. , where the hidden dimension of each layer is set to 150 and 100 respectively. The hyper-parameter used to balance between two loss functions in Equation (22) is set to 2 and is set to 0.001.

For the training process, we initialize the word embedding layers of all neural models with the pre-trained 300D GloVE word embeddings333 We adopt the Adam optimizer (Kingma and Ba, 2014) to train all learnable parameters and the batch size is set to 50. All the network weights

are initialized randomly from Xavier uniform distribution

(Glorot and Bengio, 2010).

5. Results and Analysis

(a) P@1 scores for three datasets
(b) P@3 scores for three datasets
Figure 2. Effect of different semantic relations in terms of P@1 and P@3 scores among three product categories

5.1. Answer Ranking Performance

The answer ranking results among three product categories in terms of MAP, MRR, P@1, and P@3 scores are summarized in Table 3. It shows that MUSE outperforms all baseline models on each dataset. Also, we conduct a statistical significance test comparing MUSE with PHP. The results indicates that MUSE achieves better performance than PHP with statistical significance test at .

There are several notable observations from the results: (1) Compared to the basic BM25 model, we can see that deep learning models generally provide strong baselines for the concerned ranking task. In particular, BiMPM and HCAN models outperform other answer selection models by taking into account deeper and boarder semantic relevance information. (2) Models with consideration of relevant review information, e.g. PHP model and our proposed MUSE model, can generally achieve better performance. Such results demonstrate that merely considering the semantic relevance between the question and answer text is not sufficient for ranking the user-provided answers in E-commerce settings. (3) Our proposed MUSE model consistently and substantially outperforms all the baselines across three categories. This result shows that carefully modeling the rich semantic relations among the available information sources, i.e. the question, multiple answers, and relevant reviews, is necessary for effectively ranking the answers. Importantly, MUSE utilizes the multi-semantic relation graph to model the coherence information between each specific answer with the common opinions reflected in the entire answer set and reviews, which leads to its superior performance when ranking user answers in E-commerce scenario.

Comparing the performance between different variants of MUSE model, it can be observed that training with the joint loss function (i.e. MUSE-Joint-Loss) generally achieves better performance than learning with the pointwise approach (i.e. MUSE-Pointwise) or the listwise approach (i.e. MUSE-Listwise). MUSE-Joint-Loss model combines the advantages of two learning approaches and considers an answer both from the perspective of its own label and from the perspective of the labels of the entire answer list, thus it achieves better results among the majority of cases. Notably, MUSE-Listwise largely outperforms existing models regarding the MRR and P@1 scores. For example, it obtains about 6% absolute improvement of P@1 score on the Electronics category compared with the best performance given by the baseline model. The reason is that by normalizing the prediction list of all answers and minimizing its difference with the label list, we push the positive answers to have relatively higher scores. Thus the model can find one positive answer more easily and rank it as the top answer, which leads to better MRR and P@1 metrics.

5.2. Ablation Study

Model Variants MAP MRR P@1 P@3
MUSE-Joint-Loss 0.695 0.711 0.511 0.450
- w/o textual feature 0.649 0.681 0.478 0.409
- w/o interactive feature 0.646 0.673 0.468 0.411
- w/o Q-to-A attention 0.656 0.688 0.490 0.414
- w/o Q-to-C attention 0.663 0.692 0.497 0.419
Table 4. Ablation study on the Electronics category

5.2.1. Impact of Main Components of MUSE

We perform ablation studies by leaving out some important components in the proposed MUSE model to investigate their effectiveness. The results on the largest Electronics dataset are presented in Table 4. We first create two variant models by discarding the textual feature (denoted as w/o textual feature) and the interaction feature (denoted as w/o interaction feature) for a specific answer when obtaining the prediction score in Equation (18) respectively. From the results, we can find that both of them play an important role in contributing useful answer representations for the ranking. Specifically, the model without interaction feature suffers a slightly larger performance decrease, which is likely due to the fact that the textual features is used as the initialization to compute in the multiple semantic relation modeling so that some textual information encoded in can be preserved in the ranking process.

In addition, to testify the usefulness of the question attention operation during the textual feature modeling of the answers and reviews, we create two variant models by directly conducting a max-pooling operation on and in Equation (2) to get and respectively, instead of employing attention mechanism (denoted as ”w/o Q-to-A attention” and ”w/o Q-to-C attention” in Table 4 respectively). It can be observed that both lead to a performance decrease, indicating that utilizing the question to attend the important information during the encoding phase of the answer and review sentences is useful for capturing relevant and important information for the subsequent learning process.

Especially, we can notice that leaving out the question attention of the answers results in a larger performance decrease. This result shows that the core semantic information in answers is still essential for modeling the textual representations for answer ranking.

5.2.2. Impact of Different Semantic Relations.

To better rank the multiple answers for a given question, we model the multiple semantic relations among the question, answers, and relevant reviews in this paper. Thus, we examine the effect of each semantic relation during the graph construction phase in this section by removing one relation at each time. We present P@1 and P@3 scores among three product categories in Figure 3, where ”w/o relevance”, ”w/o similarity” and ”w/o entailment” denote the MUSE-Joint-Loss model without the semantic relevance, textual similarity, and textual entailment relation when constructing the multi-semantic graph in Section 3.3.2 respectively. Also ”all relation” refers to the performance of the model with all three relations. We can see that each of the semantic relations contributes to the final ranking performance and discarding any of them leads to performance degradation. This result illustrates the importance of explicitly modeling the complex relations among the multiple information sources in E-commerce scenario. In addition, it can be noticed that the entailment relation attaches more importance than the other two relations, which validates the necessity to utilize relevant reviews as external sources for modeling the opinion coherence between a concerned answer with the common opinion. Moreover, discarding the semantic relevance relation leads to the least performance decrease since the reviews obtained by an initial retrieval process are somehow already related to the question.

5.3. Analysis of MUSE model

Figure 3.

Analysis of MUSE model. Figure (a) shows the performance with respect to the number of reviews utilized by MUSE on three datasets. Figure (b) presents the MAP score on the test set at each epoch of MUSE with different learning approaches.

5.3.1. Number of Reviews

The relevant reviews are utilized as important external information in our concerned task. The proposed MUSE model aggregates common opinions with the review-review similarity relation and models the opinion coherence of an answer with the review-answer entailment relation. In this section, we vary the number of review snippets used in the model, i.e. the value of , to investigate its effect on the model performance. The MAP and MRR scores with different number of reviews used in the model on three datasets are presented in Figure 2(a).

We can see that, as expected, the performance of the model is getting better when more review information is utilized at the beginning. However, both MAP and MRR scores become generally unchanged (e.g. on the Electronics and Home dataset) or even slightly decrease (e.g. on the Sports dataset) when we further increase the number of reviews. On one hand, more reviews can provide more comprehensive common opinions from the community to help rank the candidate answers, leading to the performance improvement at the beginning. On the other hand, increasing the number of reviews used in the model also introduces more parameters and leads to the overfitting issue for those smaller datasets such as Sports.

5.3.2. Different Learning Approaches.

We compute the MAP scores at each epoch on the test set of the proposed MUSE model trained with different loss functions. The results on the largest Electronics dataset are shown in Figure 2(b). It can be observed that MUSE trained with listwise loss function converges in fewer epochs than the other two approaches and is less affected by the overfitting issue, which is consistent with some observations from previous studies (Bian et al., 2017; Liu, 2009). On the contrary, the model trained with pointwise learning approach is more likely to overfit after a few epochs. Such a phenomenon is likely due to the imbalance proportion between the high quality and low quality answers. Thus the model which is trained to only recognize the label of each answer individually will tend to predict an answer as a negative one so as to minimize the overall cross-entropy loss. For the model trained with the joint loss function, it is robust to the overfitting problem but converges at a relatively slow pace compared with the listwise learning approach.

5.4. Case Study

Question:   Is it automatic shut off?
Relevant Review Snippets :
: A buzzer sounds to let you know the eggs are done
: When the alarm sounds you need to turn it off and open it…
: I would have liked the cooker to turn off automatically but instead a
     bell rings until you turn if off.
: Also, by the time the timer goes off, the hot pan has a burning smell.
: and it turns off itself after the bell rings.
: No, it beeps until you turn it off. R3 R1 R1
: No it’s not but it beeps very loud. R4 R4 R2
: Yes, and it works very well…….we just had poached eggs yesterday. recommend this product :) R1 R3 R3
: Yes there’s an automatic shut-off when the cooking cycle is finished. R2 R2 R4
Table 5. A sample case of multiple answers ranked by their original community votes, as well as their predicted ranks by MUSE and two baseline models.

To gain some insights into the proposed MUSE model, we present a sample case in Table 5, which includes a question of an egg cooker product, its multiple user-provided answers, as well as the relevant review snippets. The answers are ranked by their original community votes. We also present the ranks given by MUSE and two strong baseline models, namely HCAN and PHP for each answer, where ”Rx” denotes that the answer is ranked at the -th position by the corresponding model. From the results, we can see that HCAN performs poorly on this case since it only considers the semantic relevance between the answer with the question text, while all the associated answers are quite topically relevant to the given question. Besides, the PHP model, which incorporates review information into the modeling, also fails to ranks all answers correctly, indicating that the semantic relations need to be appropriately exploited. The proposed MUSE model utilizes the answer-answer similarity and review-review similarity relations to capture the common opinion that ”the concerned egg cooker cannot turn off automatically”. The answer-review entailment relation can then help examine the opinion coherence between each specific answer with the common opinion and hence help the ranking process. Therefore, it successfully ranks all answers in this case. This real-world example indicates the importance of taking review information into consideration. More importantly, carefully modeling the complex semantic relations between the question, answers, and reviews is essential for tackling this task in E-commerce settings.

6. Conclusions

We investigate the answer ranking problem for product-related questions in this paper. To tackle the ranking task in E-commerce settings, we propose a framework named MUSE to jointly model the multiple semantic relations among the question, answers, and relevant reviews. MUSE employs a novel graph convolutional operation customized to integrate the coherence information under different semantic relations to facilitate the ranking task. Extensive experiments on real-world E-commerce datasets show that our proposed model achieves superior performance compared with some strong baseline models.


  • W. Bian, S. Li, Z. Yang, G. Chen, and Z. Lin (2017) A compare-aggregate model with dynamic-clip attention for answer selection. In CIKM, pp. 1987–1990. Cited by: §3.2.3, §5.3.2.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In EMNLP, pp. 632–642. Cited by: §1.
  • D. Carmel, L. Lewin-Eytan, and Y. Maarek (2018) Product question answering using customer generated content - research challenges. In SIGIR, pp. 1349–1350. Cited by: §1.
  • C. Chen, M. Qiu, Y. Yang, J. Zhou, J. Huang, X. Li, and F. S. Bao (2019a) Multi-domain gated CNN for review helpfulness prediction. In WWW, pp. 2630–2636. Cited by: §2.
  • L. Chen, Z. Guan, W. Zhao, W. Zhao, X. Wang, Z. Zhao, and H. Sun (2019b) Answer identification from product reviews for user questions by multi-task attentive networks. In AAAI, pp. 45–52. Cited by: §1, §2.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2017) Enhanced LSTM for natural language inference. In ACL, pp. 1657–1668. Cited by: §1.
  • S. Chen, C. Li, F. Ji, W. Zhou, and H. Chen (2019c) Review-driven answer generation for product-related questions in e-commerce. In WSDM, pp. 411–419. Cited by: §1, §2.
  • W. Chen, Y. Gu, Z. Ren, X. He, H. Xie, T. Guo, D. Yin, and Y. Zhang (2019d) Semi-supervised user profiling with heterogeneous graph attention networks. In IJCAI, pp. 2116–2122. Cited by: §2.
  • Y. Deng, W. Lam, Y. Xie, D. Chen, Y. Li, M. Yang, and Y. Shen (2019) Joint learning of answer selection and answer summary generation in community question answering. CoRR abs/1911.09801. Cited by: §1.
  • Y. Deng, Y. Shen, M. Yang, Y. Li, N. Du, W. Fan, and K. Lei (2018) Knowledge as A bridge: improving cross-domain answer selection with external knowledge. In COLING, pp. 3295–3305. Cited by: §2.
  • M. Eric, L. Krishnan, F. Charette, and C. D. Manning (2017) Key-value retrieval networks for task-oriented dialogue. In SIGdial, pp. 37–49. Cited by: §1.
  • M. Fan, C. Feng, L. Guo, M. Sun, and P. Li (2019) Product-aware helpfulness prediction of online reviews. In WWW, pp. 2715–2721. Cited by: §2, 7th item, Table 3.
  • S. Gao, Z. Ren, Y. Zhao, D. Zhao, D. Yin, and R. Yan (2019) Product-aware answer generation in e-commerce question-answering. In WSDM, pp. 429–437. Cited by: §1, §2.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pp. 249–256. Cited by: §4.3.
  • K. Halder, M. Kan, and K. Sugiyama (2019) Predicting helpful posts in open-ended discussion forums: A neural architecture. In NAACL-HLT, pp. 3148–3157. Cited by: §2, 8th item, §4.1, Table 3.
  • S. M. Harabagiu and A. Hickl (2006) Methods for using textual entailment in open-domain question answering. In ACL, Cited by: §3.3.1.
  • S. Harel, S. Albo, E. Agichtein, and K. Radinsky (2019) Learning novelty-aware ranking of answers to complex questions. In WWW, pp. 2799–2805. Cited by: §2.
  • R. He and J. J. McAuley (2016) Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In WWW, pp. 507–517. Cited by: §4.1.
  • J. Jeon, W. B. Croft, J. H. Lee, and S. Park (2006) A framework to predict the quality of answers with non-textual features. In SIGIR, pp. 228–235. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §2, §3.3.1.
  • A. Li, Z. Qin, R. Liu, Y. Yang, and D. Li (2019) Spam review detection with graph convolutional networks. In CIKM, pp. 2703–2711. Cited by: §2.
  • T. Liu (2009) Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3 (3), pp. 225–331. Cited by: §5.3.2.
  • S. Lyu, W. Ouyang, Y. Wang, H. Shen, and X. Cheng (2019) What we vote for? answer selection from user expertise view in community question answering. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, pp. 1198–1209. Cited by: §2.
  • J. McAuley and A. Yang (2016) Addressing complex and subjective product-related queries with customer reviews. In WWW, pp. 625–635. Cited by: §2.
  • P. Nakov, D. Hoogeveen, L. Màrquez, A. Moschitti, H. Mubarak, T. Baldwin, and K. Verspoor (2017) SemEval-2017 task 3: community question answering. In SemEval@ACL, pp. 27–48. Cited by: §1, §2.
  • J. Ni, J. Li, and J. J. McAuley (2019) Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In EMNLP-IJCNLP, pp. 188–197. Cited by: §4.1.
  • A. Omari, D. Carmel, O. Rokhlenko, and I. Szpektor (2016) Novelty based ranking of human answers for community questions. In SIGIR, pp. 215–224. Cited by: §2.
  • X. Qiu and X. Huang (2015)

    Convolutional neural tensor network architecture for community-based question answering

    In IJCAI, pp. 1305–1311. Cited by: §1.
  • J. Rao, H. He, and J. Lin (2017) Experiments with convolutional neural network models for answer selection. In SIGIR, pp. 1217–1220. Cited by: §2.
  • J. Rao, L. Liu, Y. Tay, W. Yang, P. Shi, and J. Lin (2019) Bridging the gap between relevance matching and semantic matching for short text similarity modeling. In EMNLP-IJCNLP, pp. 5373–5384. Cited by: §1, §3.3.1, 6th item, Table 3.
  • S. E. Robertson and H. Zaragoza (2009) The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3 (4), pp. 333–389. Cited by: 1st item, Table 3.
  • M. S. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In ESWC, pp. 593–607. Cited by: §2, §3.3.3.
  • A. Severyn and A. Moschitti (2013) Automatic feature engineering for answer selection and extraction. In EMNLP, pp. 458–467. Cited by: §2.
  • A. Severyn and A. Moschitti (2015) Learning to rank short text pairs with convolutional deep neural networks. In SIGIR, pp. 373–382. Cited by: §2, 2nd item, Table 3.
  • C. Shah and J. Pomerantz (2010) Evaluating and predicting answer quality in community QA. In SIGIR, pp. 411–418. Cited by: §2.
  • T. Shao, F. Cai, H. Chen, and M. de Rijke (2019) Length-adaptive neural network for answer selection. In SIGIR, pp. 869–872. Cited by: §2.
  • Z. Sun, J. Tang, P. Du, Z. Deng, and J. Nie (2019) DivGraphPointer: A graph pointer network for extracting diverse keyphrases. In SIGIR, pp. 755–764. Cited by: §2.
  • M. A. Suryanto, E. Lim, A. Sun, and R. H. L. Chiang (2009) Quality-aware collaborative question answering: methods and evaluation. In WSDM, pp. 142–151. Cited by: §1.
  • M. Tan, C. N. dos Santos, B. Xiang, and B. Zhou (2016) Improved representation learning for question answer matching. In ACL, Cited by: §1, §2, 3rd item, Table 3.
  • Y. Tay, M. C. Phan, A. T. Luu, and S. C. Hui (2017) Learning to rank question answer pairs with holographic dual LSTM architecture. In SIGIR, pp. 695–704. Cited by: §1.
  • H. Trivedi, H. Kwon, T. Khot, A. Sabharwal, and N. Balasubramanian (2019) Repurposing entailment for multi-hop question answering tasks. In NAACL-HLT, pp. 2948–2958. Cited by: §3.3.1.
  • K. Tymoshenko and A. Moschitti (2018) Cross-pair text representations for answer sentence selection. In NAACL, pp. 2162–2173. Cited by: §1, §3.3.1.
  • M. Wan and J. J. McAuley (2016) Modeling ambiguity, subjectivity, and diverging viewpoints in opinion question answering systems. In ICDM, pp. 489–498. Cited by: §2, §4.1.
  • M. Wang, N. A. Smith, and T. Mitamura (2007) What is the jeopardy model? A quasi-synchronous grammar for QA. In EMNLP-CoNLL, pp. 22–32. Cited by: §1, §2.
  • S. Wang and J. Jiang (2017) A compare-aggregate model for matching text sequences. In ICLR, Cited by: §2.
  • Z. Wang, W. Hamza, and R. Florian (2017) Bilateral multi-perspective matching for natural language sentences. In IJCAI, pp. 4144–4150. Cited by: §2, 5th item, Table 3.
  • L. Yang, Q. Ai, J. Guo, and W. B. Croft (2016) ANMM: ranking short answer texts with attention-based neural matching model. In CIKM, pp. 287–296. Cited by: 4th item, Table 3.
  • R. Yang, J. Zhang, X. Gao, F. Ji, and H. Chen (2019) Simple and effective text matching with richer alignment features. In ACL, pp. 4699–4709. Cited by: §2.
  • Y. Yang, W. Yih, and C. Meek (2015) WikiQA: A challenge dataset for open-domain question answering. In EMNLP, pp. 2013–2018. Cited by: §1.
  • J. Yu, M. Qiu, J. Jiang, J. Huang, S. Song, W. Chu, and H. Chen (2018)

    Modelling domain relationships for transfer learning on retrieval-based question answering systems in e-commerce

    In WSDM, pp. 682–690. Cited by: §1.
  • J. Yu, Z. Zha, and T. Chua (2012) Answering opinion questions on products by exploiting hierarchical organization of consumer reviews. In EMNLP-CoNLL, pp. 391–401. Cited by: §1, §2.
  • Q. Yu and W. Lam (2018) Review-aware answer prediction for product-related questions incorporating aspects. In WSDM, pp. 691–699. Cited by: §1, §2.
  • W. Zhang, W. Lam, Y. Deng, and J. Ma (2020) Review-guided helpful answer identification in e-commerce. In WWW, pp. 2620–2626. Cited by: §1, §2, §4.1.