1. Introduction
Individuals often visit Community Question Answer (CQA) forums, like StackExchange, to seek answers to nuanced questions; typically, these answers are not readily available on websearch engines.
A fundamental challenge for a CQA search engine in response to an individual’s question (i.e., information need) is to rank and identify similar past questions and relevant answers to those questions. On some CQA sites like StackExchange, individuals who post questions may label an answer as ‘accepted,’ but other questions with answers (about % in our analysis) have none labeled as ‘accepted.’ On other CQA sites like Reddit, there is no mechanism for a person to label an answer as ‘accepted.’ As a first step to address the individual’s information needs, in this paper, we focus on the problem of identifying accepted answers on StackExchange.
One approach to identify relevant answers is to identify salient features for each questionanswer tuple and treat it as a supervised classification problem (BurelMA16; JendersKN16; TianZL13; TianL16). Deep Text Models further develop this approach (ZhangLSW17; WuWS18; WangN15; SukhbaatarSWF15). These models learn the optimal text representation of tuple to select the most relevant answer. While the deep text models are sophisticated, text based models are computationally expensive to train. Furthermore, there are limitations to examining tuples in isolation: an answer is ”relevant” in relationship to other answers to the same question; second, it ignores the fact that same user may answer multiple questions in the forum. An alternative approach then is to examine the graph structure resulting from users answering multiple questions in addition to the answer features. Graph Convolutional Networks (GCNs) is a popular technique to incorporate graph structure, and are used in tasks including node classification (gcn) and link prediction (relationalGCN). Extensions to the basic GCN model include signed networks (signedgcn), inductive settings (graphsage) and multiple relations (DualGCN; relationalGCN).
While GCNs are a plausible approach, we need to overcome a fundamental implicit assumption in prior work before we can apply it towards our problem. Prior work in GCNs adopt label sharing amongst nodes; label sharing implicitly assumes similarity between two nodes connected by an edge. In the Answer Selection problem, however, answers to the same question connected by an edge may not share acceptance label. In particular, we may label an answer as ‘accepted’ based on how it differs with other answers to the same question. Signed GCNs (signedgcn) can not capture this contrast despite their ability to incorporate signed edges. Graph attention networks (graphattention) also could not learn negative attention weight over neighbors as weights are the output of a softmax operation.
We develop a novel induced relational framework to address our problem. The key idea is to use diverse strategies—label depends only on the answer (reflexive), label is determined in contrast with the other answers to the question (contrastive), and label sharing among answers across questions if it contrasts with other answers similarly(similarity by contrast)—to identify the accepted answer. Each strategy induces a graph between tuples and then uses a particular label selection mechanism to identify the accepted answer. Our strategies generalize to a broader principle: pick an equivalence relation to induce a graph comprising cliques, and then pick a label selection mechanism (label sharing or label contrast) within each clique. We show how to develop GCN architecture to operationalize the specific label selection mechanism (label sharing or label contrast). Then, we aggregate results across strategies through a boosting framework to identify the label for each tuple. Our Contributions are as follows:
 Modular, Induced Relational Framework::

We introduce a modular framework that separates the construction of the graph with the label selection mechanism. In contrast, prior work in answer selection (e.g., (BurelMA16; JendersKN16; TianZL13; TianL16).) looked at individual tuples, and work on GCNs (e.g., (gcn; DualGCN)) work with the given graph (i.e., no induced graphs) and with similarity as a mechanism for label propagation. We use equivalence relations to induce a graph comprising cliques and identify two label assignment mechanisms—label contrast, label sharing. Then, we show how to encode these assignment mechanisms in GCNs. In particular, we show that the use of equivalence relations allows us to perform exact convolution in GCNs. We call our framework Induced Relational GCN (IRGCN). Our framework allows for parallelization and applies to other problems that need application semantics to induce graphs independent of any existing graphs(InducedGraph).
 Discriminative Semantics::

We show how to encode the notion of label contrast between a vertex and a group of vertices in GCNs. Label contrast is critical to the problem of accepted answer selection. Related work in GCNs (e.g., (gcn; DualGCN)) emphasizes node similarity, including the work on signed graphs (signedgcn). In (signedgcn), contrast is a property of an edge, not a group and is not expressive enough for our problem. We show that our encoding of contrast creates discriminative magnification—the separation between nodes in the embedding space is most meaningful at smaller clique sizes; the effect decreases with clique size.
 Boosted Architecture::

We show through extensive empirical results that using common boosting techniques improves learning in our convolutional model. This improvement is a surprising result since much of the work on neural architectures develops stacking, fusion, or aggregator architectures.
We conducted extensive experiments using our IRGCN framework with excellent experimental results on popular CQA forum—StackExchange. For our analysis, we collect data from 50 communities—the ten largest communities from each of the five StackExchange^{1}^{1}1https://stackexchange.com/sites categories. We achieved an improvement of over 4% accuracy and 2.5% in MRR on average over stateoftheart baselines. We also show that our model is more robust to label sparsity compared to alternate GCN based multirelational approaches.
We organize the rest of this paper as follows. In section 2, we formulate our problem statement and then discuss induced relations for Answer Selection problem in section 3. We then detail the operationalization of these induced relations in Graph Convolution framework in section 4
and introduce our gradient boosting based aggregator approach in
section 5. Section 6 describes experiments. We discuss related work in section 7 and then conclude in section 8.2. Problem Formulation
In Community Question Answer (CQA) forums, an individual asking a question seeks to identify the most relevant candidate answer to his question. On StackExchange CQA forums, users annotate their preferred answer as “accepted.”
Let denote the set of questions in the community and for each , we denote to be the associated set of answers. Each question , and each answer has an author respectively. Without loss of generality, assume that we can extract features for each question , each answer , user .
Our unit of analysis is a questionanswer tuple , and we associate each tuple with a label , where ‘+1’ implies acceptance and ‘1’ implies rejection.
The goal of this paper is to develop a framework to identify the accepted answer to a question posted on a CQA forum.
3. Induced Relational Views
In this section, we discuss the idea of induced relational views, central to our induced relational GCN framework developed in Section 4. First, in Section 3.1, we introduce potential strategies for selecting the accepted answer given a question. We show how each strategy induces a graph on the questionanswer tuples. Next, in Section 3.2, we show how each of these example strategies is an instance of an equivalence relation; our framework generalizes to incorporate any such relation.
3.1. Constructing Induced Views
In this section, we discuss in detail four example strategies that can be used by the individual posting the question to label an answer as ‘accepted.’ Each of the strategies induces a graph (also referred to as a relational view). In each graph , a vertex corresponds to a tuple and an edge connects two tuples that are matched under that strategy. Note that each has the same vertex set , and the edge sets are strategy dependent. Each strategy employs one of the three different relation types—reflexive, contrastive, and similar—to connect the tuples. We use one reflexive strategy, one contrastive, and two similar strategies. Figure 1 summarizes the three relations. Below, we organize the discussion by relation type.
3.1.1. Reflexive
A natural strategy is to examine each tuple in isolation and then assign a label corresponding to ‘not accepted’ or ‘accepted.’ In this case, depends on only the features of . This is a Reflexive relation, and the corresponding graph has a specific structure. In particular, in this graph , we have only selfloops, and all edges are of the type . That is, for each vertex , there are no edges to any other vertices . Much of the prior work on feature driven answer selection (BurelMA16; JendersKN16; TianZL13; TianL16) adopts this view.
3.1.2. Contrastive
A second strategy is to examine answers in relation to other answers to the same question and label one such answer as ‘accepted.’ Thus the second strategy contrasts , with other tuples in . This is a Contrastive relation and the corresponding graph has a specific structure. Specifically, we define an edge for all tuples for the same question . That is, if , . Intuitively, the contrastive relation induces cliques connecting all answers to the same question. Introducing contrasts between vertices sharpens differences between features, an effect (described in more detail in Section 4.2) we term Discriminative Feature Magnification. Notice that the contrastive relation is distinct from graphs with signed edges (e.g., (signedgcn)). In our framework, the contrast is a neighborhood property of a vertex, whereas in (signedgcn), the negative sign is a property of an edge.
3.1.3. Similar Contrasts
A third strategy is to identify similar tuples across questions. Prior work (Wu2016) indicates that individuals on StackExchange use diverse strategies to contribute answers. Experts (with high reputation) tend to answer harder questions, while new members (with low reputation) looking to acquire reputation tend to be the first to answer a question.
How might similarity by contrast work? Consider two individuals Alice and Bob with similar reputations (either high or low) on StackExchange, who contribute answers and to questions and respectively. If Alice and Bob have high reputation difference with other individuals who answer questions and respectively, then it is likely that and will share the same label (if they are both experts, their answers might be accepted, if they are both novices, then this is less likely). However, if Alice has a high reputation difference with other peers who answer , but Bob does not have that difference with peers who answer , then it is less likely that the tuples and will share the label, even though the reputations of Alice and Bob are similar.
Thus the key idea of the Similar Contrasts relation is that link tuples that are similar in how they differ with other tuples. We construct the graph in the following manner. An edge between tuples and exists if the similarity between tuples exceeds a threshold . We define the similarity function to encode similarity by contrast. That is, .
Motivated by (Wu2016), we consider two different views that correspond to the similar contrast relation. The TrueSkill Similarity view connects all answers authored by a user where her skill (computed via Bayesian TrueSkill (TrueSkill06))) differs from competitors by margin . We capture both cases when the user is less or more skilled than her competitors. In the Arrival Similarity view, we connect answers across questions based on the similarity in the relative time of their arrival (posting timestamp). Notice that two Similar Contrast views have different edge () sets since the corresponding similarity functions are different. Notice also, that the two similarity function definitions are transitive. ^{2}^{2}2One trivial way of establishing similarity is coauthorship i.e. connect all
tuples of a user (probably on the same topic) across different questions. Note that the accepted answer is labeled in relation to the other answers. As the competing answers are different in each question, we can not trivially assume acceptance label similarity for all coauthored answers. In our experiments, coauthorship introduced a lot of noisy links in the graph leading to worse performance.
3.2. Generalized Views
Now we present the general case of the induced view. First, notice that each of the three relation types that we consider—reflexive, contrastive, and similar—result in a graph comprising a set of cliques. This is not surprising, since all three relations presented here, are equivalence relations. Second, observe the semantics of how we select the tuple with the accepted answer. Within the three relations, we used two semantically different ways to assign the ‘accepted’ answer label to a tuple. One way is to share the labels amongst all the vertices in the same clique (used in the reflexive and the similar relations). Second is to assign label based on contrasts with other vertices in the same clique. We can now state the organizing principle of our approach as follows.
A generalized modular framework: pick a meaningful equivalence relation on the tuples to induce graph comprising cliques and then apply specific label semantics within each clique.
Equivalence relation results in a graph with a set of disconnected cliques. Then, within a clique, one could use applicationspecific semantics, different from two discussed in this paper, to label tuples as ‘accepted.’ Cliques have some advantages: they have welldefined graph spectra (Chung1997, p. 6); cliques allows for exact graph convolution; parallelize the training as the convolution of a clique is independent of other cliques.
Thus, each strategy induces a graph using one of the three equivalence relations—reflexive, contrastive and similar—and then applies one of the two semantics (‘share the same label’; ‘determine label based on contrast’).
4. Induced Relational GCN
Now, we will encode the two label assignment mechanisms within a clique via a graph convolution. First, we briefly review Graph Convolution Networks (GCN) and identify some key concepts. Then, given the views for the four strategies, we show how to introduce label contrasts in Section 4.2 followed by label sharing in Section 4.3.
4.1. Graph Convolution
Graph Convolution models adapt the convolution operations on regular grids (like images) to irregular graphstructured data , learning lowdimensional vertex representations. If for example, we associate a scalar with each vertex , where , then we can describe the convolution operation on a graph by the product of signal
(feature vectors) with a learned filter
in the fourier domain. Thus,(1) 
where, and
are the eigenvalues and eigenvector of the normalized graph Laplacian,
, and where . denotes the adjacency matrix of a graph (associated with a view) with vertices. Equation 1 implies a filter with free parameters, and requires expensive eigenvector decomposition of the adjacency matrix . deferrard proposed to approximate , which in general is a function of , by a sum of Chebyshev polynomials up to the th order. Then,(2) 
where, are the scaled eigenvalues and is the corresponding scaled Laplacian. Since , the two equations are approximately equal.
The key result from deferrard is that Equation 2 implies hop localization—the convolution result depends only on the hop neighborhood. In other words, Equation 2 is a hop approximation.
However, since we use equivalence relations in our framework that result in cliques, we can do an exact convolution operation since vertices in a clique only have onehop (i.e., ) neighbors (see lemma 5.2, (Hammond2011)). The resulting convolution is linear in and now has only two filter parameters, and shared over the whole graph.
(3) 
We emphasize the distinction with gcn who approximate the deferrard observation by restricting . They do so since they work on arbitrary graphs; since our relations result in views with cliques, we do not make any approximation by using .
4.2. Contrastive Graph Convolution
Now, we show how to perform graph convolution to encode the mechanism of contrast, where label assignments for a tuple depend on the contrast with its neighborhood.
To establish contrast, we need to compute the difference between the vertex’s own features to its neighborhood in the clique. Thus we transform Equation 3 by setting = , which essentially restricts the filters learned by the GCN. This transformation leads to the following convolution operation:
(4)  
(5) 
Notice that Equation 5 says that for example, for any vertex with a scalar feature value , for a given clique with vertices, the convolution operation computes a new value for vertex as follows:
(6) 
where is the neighborhood of vertex . Notice that since our equivalence relations construct cliques, for all vertices that belong to a clique of size , .
When we apply the convolution operation in Equation 5 at each layer of GCN, output for the th layer is:
(7) 
with denoting the adjacency matrix in the contrastive view. are the learned vertex representations for each tuple under the contrastive label assignment. is the total number of tuples and refers to the dimensionality of the embedding space. refers to the output of the previous th layer, and where is the input feature matrix. are the filter parameters learnt by the GCN;
denotes the activation function (e.g. ReLU,
).To understand the effect of Equation 7 on a tuple, let us restrict our attention to a vertex in a clique of size . We can do this since the convolution result in one clique is unaffected by other cliques. When we do this, we obtain:
(8) 
Now consider a pair of contrasting vertices, and in the same clique of size
. Let us ignore the linear transform by setting
and set to the identity function. Then we can easily verify that:(9) 
where, denotes the output of the th convolution layer for the th vertex in the contrastive view. As a result, each convolutional layer magnifies the feature contrast between the vertices that belong to the same clique. Thus, the contrasting vertices move further apart. We term this as Discriminative Feature Magnification and Equation 9 implies that we should see higher magnification effect for smaller cliques.
4.3. Encoding Similarity Convolution
We next discuss how to encode the mechanism of sharing labels in a GCN. While label sharing applies to our similar contrast relation (two strategies: Arrival similarity; TrueSkill similarity, see Section 3.1), it is also trivially applicable to the reflexive relation, where the label of the tuple only depends on itself. First, we discuss the case of similar contrasts.
4.3.1. Encoding Similar Contrasts
To encode label sharing for the two similar by contrast cases, we transform Equation 3 with the assumption . Thus
(10) 
Similar to the Equation 5 analysis, convolution operation in Equation 10 computes a new value for vertex as follows:
(11)  
(12) 
That is, in the mechanism where we share labels in a clique, the convolution pushes the values of each vertex in the clique to the average feature value, , in the clique.
When we apply the convolution operation in Equation 10 at each layer of GCN, output for the th layer:
(13) 
with denoting the adjacency matrix in the similar views.
We analyze the similarity GCN in a maner akin to Equation 8 and we can easily verify that:
(14) 
where, denotes the output of the th convolution layer for the th vertex in the similar view. As a result, each convolutional layer reduces the feature contrast between the vertices that belong to the same clique. Thus, the similar vertices move closer.
The proposed label sharing encoding applies to both similar contrast strategies (TrueSkill; Arrival). We refer to the corresponding vertex representations as (TrueSkill), (Arrival).
4.3.2. Reflexive Convolution
We encode the reflexive relation with selfloops in the graph resulting in an identity adjacency matrix. This relation is the trivial label sharing case, with an independent assignment of vertex labels. Thus, the output of the th convolutional layer for the reflexive view, reduces to:
(15) 
Hence, the reflexive convolution operation is equivalent to a feedforward neural network with multiple layers and activation
.Each strategy belongs to one of the three relation types—reflexive, contrastive and similarity, where denotes the set of strategies of that relation type. denotes the set of all relation types. represents the dimensional vertex embeddings for strategy at the th layer. For each strategy , we obtain a scalar score by multiplying with transform parameters . The sum of these scores gives the combined prediction score, , for that relation type.
(16) 
In this section, we proposed Graph Convolutional architectures to compute vertex representations of each tuple under the four strategies. In particular, we showed how to encode two different label assignment mechanisms—label sharing and determine label based on contrast—within a clique. The architecture that encodes label assignment based on contrast is a novel contribution; distinct from the formulations presented by gcn and its extensions (signedgcn; relationalGCN). Prior convolutional architectures implicitly encode the label sharing mechanism ( eq. 10); however, label sharing is unsuitable for contrastive relationships across vertices. Hence our architecture fills this gap in prior work.
5. Aggregating Induced Views
In the previous sections, we introduced four strategies to identify the accepted answer to a question. Each strategy induces a graph or relational view between tuples. Each relational view is expected to capture semantically diverse neighborhoods of vertices. The convolution operator aggregates the neighborhood information under each view. The key question that follows is, how do we combine these diverse views in a unified learning framework? Past work has considered multiple solutions:

Neighborhood Aggregation: In this approach, we represent vertices by aggregating feature representations of it’s neighbors across all views (graphsage; relationalGCN).

Stacking: Multiple convolution layers stacked endtoend (each potentially handling a different view) (Stacking).

Fusion: Follows a multimodal fusion approach (Fusion18), where views are considered distinct data modalities.

Shared Latent Structure: Attempts to transfer knowledge across relational views (modalities) with constraints on the representations (e.g. (DualGCN) aligns embeddings across views).
Ensemble methods introduced in (relationalGCN)
work on multirelational edges in knowledge graphs. None of these approaches are directly suitable for our induced relationships. Our relational views utilize different label assignment semantics (label sharing within a clique vs. determine label based on contrast within a clique). In our label contrast semantics, we must achieve feature discrimination and label inversion between contrasting vertices, as opposed to label homogeneity and feature sharing in the label sharing case. Thus, aggregating relationships by pooling, concatenation, or addition of vertex representations fail to capture semantic heterogeneity of the induced views. Further, data induced relations are uncurated and inherently noisy. Directly aggregating the learned representations via Stacking or Fusion can lead to noise propagation. We also expect views of the same relation type to be correlated.
We thus propose the following approach to aggregate information across relation types and between views of a relation type.
Crossrelation Aggregation: We expect distinct relation types to perform well on different subsets of the set of tuples. We empirically verify this with the Jaccard overlap between the set of misclassified vertices under each relational view of a relation type on our dataset. Given and , the sets of tuples misclassified by GCNs and respectively, the jaccard overlap is,
The values are as follows for the relational pairings: (Contrastive, TrueSkill Similarity) = 0.42, (Contrastive, Reflexive) = 0.44 and (Reflexive, TrueSkill Similarity) = 0.48. Relatively low values of the overlap metric indicate uncorrelated errors across the relations.
Gradient boosting techniques are known to improve performance when individual classifiers, including neural networks
(ncboost), are diverse yet accurate. A natural solution then is to apply boosting to the set of relation types and bridge the weaknesses of each learner. We employ Adaboost (adaboost) to combine relation level scores, ( eq. 16) in a weighted manner to compute the final boosted score, representing all relation types (Line 12, algorithm 1). denotes the acceptance label of all tuples. Note that an entry in when the accepted label of the corresponding tuple and sign of the prediction score, , of relation type match and otherwise. Thus, the weights adapt to the fraction of correctly classified tuples to the misclassified tuples by the relation (Line 9, algorithm 1). The precise score computation is described in algorithm 1. We use the polarity of each entry in the boosted score, , to predict the class label of the corresponding tuple. The final score is also used to create a ranked list among all the candidate answers, for each question, . represents the position of candidate answer in the ranked list for question .Intrarelation Aggregation: Gradient boosting methods can effectively aggregate relation level representations, but are not optimal within a relationship type (since it cannot capture shared commonalities between different views of a relation type). For instance, we should facilitate information sharing between the TrueSkill similarity and Arrival similarity views. Thus, if an answer is authored by a user with a higher skill rating and answered significantly earlier than other answers, its probability to be accepted should be mutually enhanced by both signals. Empirically, we also found True Skill and Arrival Similarity GCNs to commit similar mistakes ( = 0.66). Thus, intrarelation learning (within a single relation type like Similar Contrast) can benefit from sharing the structure of their latent spaces i.e., weight parameters of GCN.
Weight Sharing: For multiple views representing a relation type (e.g., TrueSkill and Arrival Similarity), we train a separate GCN for each view but share the layerwise lineartransforms to capture similarities in the learned latent spaces. Weight sharing is motivated by a similar idea explored to capture local and global views in (DualGCN). Although sharing the same weight parameters, each GCN can still learn distinct vertex representations as each view convolves over a different neighborhood and employ random dropout during training. We thus propose to use an alignment loss term to minimize prediction difference between views of a single relation type(reg). The loss attempts to align the learned vertex representations at the last layer (the loss term aligns pairs of final vertex representations, ). In principle, multiple GCNs augment performance of the relation type by sharing prior knowledge through multiple Adjacency matrices ().
Training Algorithm: Algorithm 2 describes the training algorithm for our IRGCN model. For each epoch, we first compute the aggregated prediction score of our boosted model as described in algorithm 1. We use a supervised exponential loss for training with elasticnet regularization (L1 loss  and L2 loss  ) on the graph convolutional weight matrices for each view. Note that we employ weight sharing between all views of the same relation type so that only one set of weight matrices is learned per relation. The exponential loss, , for each relation type is added alternatingly to the boosted loss. We apply an exponential annealing schedule, , i.e. a function of the training epochs (
), to the loss function of each relation. As training progresses and the boosted model learns to optimally distribute vertices among the relations, increase in
ensures more emphasis is provided to the individual convolutional networks of each relation. Figure 2 illustrates the overall architecture of our IRGCN model.6. Experiments
In this section, we first describe our dataset followed by our experimental setup; comparative baselines, evaluation metrics, and implementation details. We then present results across several experiments to evaluate the performance of our model on merging semantically diverse inducedrelations.
Technology  Culture/Recreation  Life/Arts  Science  Professional/Business  

ServerFault  AskUbuntu  Unix  English  Games  Travel  SciFi  Home  Academia  Physics  Maths  Statistics  Workplace  Aviation  Writing  
61,873  41,192  9,207  30,616  12,946  6,782  14,974  8,022  6,442  23,932  18,464  13,773  8,118  4,663  2,932  
181,974  119,248  33,980  110,235  45,243  20,766  49,651  23,956  23,837  65,800  53,772  36,022  33,220  14,137  12,009  
140,676  200,208  84,026  74,592  14,038  23,304  33,754  30,698  19,088  52,505  28,181  54,581  19,713  7,519  6,918  
2.94  2.89  3.69  3.6  3.49  3.06  3.31  2.99  3.7  2.75  2.91  2.62  4.09  3.03  4.10 
6.1. Dataset
We evaluate our approach on multiple communities catering to different topics from a popular online Community Question Answer (CQA) platform, StackExchange^{3}^{3}3https://stackexchange.com/. The platform divides the communities into five different categories, i.e. Technology (), Culture/Recreation (), Life/Arts (), Science () and Professional (). For our analysis, we collect data from the ten largest communities from each of the five categories until March 2019, resulting in a total of 50 StackExchange communities. In StackExchange, each questioner can mark a candidate answer as an ”accepted” answer. We only consider questions with an accepted answer. Table 1 shows the final dataset statistics.
For each tuple, we compute the following basic features:
Activity features : View count of the question, number of comments for both question and answer, the difference between posting time of question and answer, arrival rank of answer (we assign rank 1 to the first posted answer) (TianZL13).
Text features : Paragraph and word count of question and answer, presence of code snippet in question and answer (useful for programming based forums), word count in the question title.
User features : Word count in user profile’s Aboutme section for both users; one posting the question and other posting the answer.
Timedependent features like upvotes/downvotes of the answer and user feature like reputation or badges used in earlier studies on StackExchange (BurelMA16) are problematic for two reasons. First, we only know the aggregate values, not how these values change with time. Second, since these values typically increase over time, it is unclear if an accepted answer received the votes prior to or after an answer was accepted. Thus, we do not use such timedependent features for our model and the baselines in our experiments.
6.2. Experimental Setup
6.2.1. Baselines
We compare against stateoftheart featurebased baselines for answer selection and competing aggregation approaches to fuse diverse relational views of the dataset (DualGCN; relationalGCN).
Random Forest (RF) (BurelMA16; TianZL13) model trains on the feature set mentioned earlier for each dataset. This model is shown to be the most effective featurebased model for Answer Selection.
FeedForward network (FF) (JendersKN16)
is used as a deep learning baseline to learn nonlinear transformations of the feature vectors for each
tuple. This model is equivalent to our Reflexive GCN model in isolation.Dual GCN (DGCN) (DualGCN) trains a separate GCN for each view. In addition to the supervised loss computed using training labels, they introduce a regularizer to minimize mean squared error (MSE) between vertex representations of two views, thus aligning the learned latent spaces. The regularizer loss is similar to our intrarelation aggregation approach but assumes label and feature sharing across all the views.
Relational GCN (RGCN) (relationalGCN) combines the output representations of previous layer of each view to compute an aggregated input to the current layer.
We also report results for each view individually: Contrastive (CGCN), Arrival Similarity (ASGCN), TrueSkill Similarity (TSGCN) and Reflexive (RGCN) with our proposed IRGCN model. We do not compare with other structurebased approaches to compute vertex representations (DeepWalk; node2vec; Planetoid; LINE) as GCN is shown to outperform them (gcn). We also compare with common aggregation strategies to merge neural representations discussed earlier in section 5 later.
6.2.2. Evaluation Metric
We randomly select 20% of the questions, to be in the test set. Then, subsequently all tuples such that comprise the set of test tuples or vertices, . The rest of the vertices, along with their label information, is used for training the model. We evaluate our model on two metrics, Accuracy and Mean Reciprocal Rank (MRR). Accuracy metric is widely used in vertex classification literature while MRR is popular for ranking problems like answer selection. Formally,
with as the product and as the indicator function. The product is positive if the accepted label and predicted label match and negative otherwise.
where refers to the position of accepted answer in the ranked list for question (Wang:2009).
Method  Technology  Culture/Recreation  Life/Arts  Science  Professional/Business  

Acc(%) pt  MRRpt  Acc(%) pt  MRRpt  Acc(%)pt  MRRpt  Acc(%) pt  MRRpt  Acc(%) pt  MRRpt  
RF (BurelMA16; TianZL13)  0.023  0.043  0.018  0.050  0.049  0.089  0.024  0.049  0.044  0.081 
FF (JendersKN16)  0.027  0.022  0.020  0.023*  0.049  0.034  0.024  0.028  0.040  0.049 
DGCN (DualGCN)  0.022  0.017  0.017  0.028  0.034  0.038  0.023*  0.035  0.031  0.046 
RGCN (relationalGCN)  0.045  0.045  0.016  0.042  0.043  0.054  0.054  0.042  0.038  0.061 
ASGCN  0.032  0.015  0.021  0.025  0.048  0.042  0.045  0.028  0.045  0.047 
TSGCN  0.032  0.018  0.023  0.023  0.061  0.048  0.042  0.031  0.046  0.044 
CGCN  0.022*  0.015 *  0.017*  0.024  0.034 *  0.040*  0.042  0.032 *  0.038*  0.034* 
IRGCN  0.023  0.014  0.018  0.025  0.032  0.037  0.021  0.028  0.026  0.032 

DGCN stands for DualGCN, RGCN stands for RelationalGCN, and IRGCN stands for Induced Relational GCN.
symbol has the secondbest performance among all other models. Our model shows statistical significance at level 0.01 overall second best model on single tail paired ttest.
6.2.3. Implementation Details
We implemented our model and the baselines in Pytorch. We use ADAM optimizer
(ADAM) for training with 50% dropout to avoid overfitting. We use four hidden layers in each GCN with hidden dimensions 50, 10, 10, 5, respectively, and ReLU activation. The coefficients of and regularizers are set to and respectively. For TrueSkill Similarity, we use margin to create links, while for Arrival similarity, we use . We implement a minibatch version of training for large graphs where each batch contains a set of questions and their associated answers. This is equivalent to training on the whole graph as we have disconnected cliques. All code and data will be released upon publication.6.3. Performance Analysis
Table 2 shows impressive gains over stateoftheart baselines for all the five categories. We report mean results for each category obtained after 5fold crossvalidation on each of the communities. Our inducedrelational GCN model beats best performing baseline by 45% on average in accuracy. The improvement in MRR values is around 2.53% across all categories. Note that MRR is based only on the rank of the accepted answer while accuracy is based on correct labeling of both accepted and nonaccepted answers.
Among individual views, Contrastive GCN performs best on all the communities. It even beats the best performing baseline DualGCN that uses all the relational views. Note that contrastive view compares between the candidate answers to a question and uses our proposed contrastive modification to the convolution operation. Arrival Similarity follows Contrastive and then Reflexive. The superior performance of Arrival Similarity view shows that early answers tend to get accepted and vice versa. It indicates that users primarily use CQA forums for quick answers to their queries. Also, recall that Reflexive predicts each vertex’s label independent of other answers to the same question. Thus, the competitive performance of Reflexive strategy indicates that vertex’s features itself are well predictive of the label. TrueSkill Similarity performs at par or slightly worse than Reflexive. Figure 3 presents tSNE distributions (sne) of the learned vertex representations () of our model applied to Chemistry StackExchange from Science category. Note that each view, including two views under Similar Contrast relation, learns a distinct vertex representation. Hence, all views are essential and contribute to our final performance.
Out of the baseline graph ensemble approaches, DualGCN performs significantly better than RelationalGCN by an average of around 26% for all categories. Recall that in RelationalGCN model, the convolution output of each view is linearly combined to compute the final output. Linear combination works well for knowledge graphs as each view can be thought of as a feature, and then it accumulates information from each feature. DualGCN is similar to our approach and trains different GCN for each view and later merges their results. However, it enforces similarity in vertex representations learned by each view. This restriction is not suitable for our inducedrelationships as they are semantically different (contrastive captures contrast in features vs. similarity enforces label sharing).
6.4. Ablation Study on Relation Types
{ Relation Type}  Tech  Culture  Life  Sci  Business 

C  71.23  75.90  78.71  72.99  76.85 
{ TS, AS }  67.86  74.15  75.75  65.80  76.13 
R  68.30  73.35  76.57  67.40  75.76 
{TS, AS } + R  69.28  75.50  76.41  70.11  77.90 
C + R  73.04  77.66  80.25  73.72  80.04 
C + { TS, AS }  72.81  78.04  81.41  72.19  80.15 
C + { TS, AS } + R  73.87  78.74  81.60  74.68  80.56 
We present results of an ablation study with different combination of relation types (Contrastive, Similar and Reflexive) used for IRGCN model in Table 3. We conducted this study on the biggest community from each of the five categories, i.e., ServerFault (Technology), English (Culture), Science Fiction (Life), Physics (Science), Workplace (Business). Similar Contrast relation (TrueSkill and Arrival) used in isolation perform the worst among all the variants. Training Contrastive and Similar Contrast relation together in our boosted framework performs similar to our final model. Reflexive GCN contributes the least as it does not consider any neighbors.
6.5. Aggregator Architecture Variants
We compare our gradient boosting based aggregation approach with other popular methods used in literature to merge different neural networks discussed in section 5.
Method  Tech  Culture  Life  Sci  Business 

Stacking (Stacking)  68.58  74.44  79.19  70.29  75.50 
Fusion (Fusion18)  72.30  77.25  80.79  73.91  79.01 
NeighborAgg (graphsage; relationalGCN)  69.29  74.28  77.94  68.42  78.64 
IRGCN  73.87  78.74  81.60  74.78  80.56 
Table 4 reports the accuracy results for these aggregator variants as compared to our model. Our method outperforms all the variants with Fusion performing the best. This reaffirms that existing aggregation models are not suitable for our problem. Note that these approaches perform worse than even Contrastive GCN except Fusion. Fusion approach performs similar to our approach but is computationally expensive as the input size for each IRGCN is linear in the number of all views in the model.
6.6. Discriminative Magnification effect
We show that due to our proposed modification to the convolution operation for contrastive view, we achieve Discriminative Magnification effect (eq. 7). Note that the difference is scaled by Clique size (), i.e. number of answers to a question, . Figure 4 shows the accuracy of our IRGCN model as compared to the FeedForward model with varying clique size. Recall that FeedForward model predict node labels independent of other nodes and is not affected by clique size. We report average results over the same five communities as above. We can observe that increase in accuracy is much more for lower clique sizes (13% improvement for and 4% for on average). The results are almost similar for larger clique sizes. In other words, our model significantly outperforms the FeedForward model for questions with fewer candidate answers. However, around 80% of the questions have very few answers() and thus this gain over FF is significant.
6.7. Label Sparsity
Graph Convolution Networks are robust to label sparsity as they exploit graph structure and are thus heavily used for semisupervised settings. Figure 5 shows the change in accuracy for Physics StackExchange from Science category at different training label rates. Even though our graph contains disconnected cliques, IRGCN still preserves robustness to label sparsity. In contrast, the accuracy of FeedForward model declines sharply with less label information. Performance of DualGCN remains relatively stable while Relational GCN’s performance increases with a decrease in label rate. Relational GCN assumes each view to be of similarity relation and thus, adding contrastive relation introduces noise in the model. However, as the training labels become extremely sparse, the training noise decreases that leads to a marked improvement in the model. In case of extremely low label rate of 0.01%, all approaches converge to the same value, which is the expectation of theoretically random selection. We obtained similar results for other four StackExchange communities but omitted them for brevity.
6.8. Limitations
We do recognize certain limitations of our work. First, we do not deal with content in our model. Our focus in this work is to exploit structural properties between tuples. We believe that adding content will further improve our results. Second, we focus on equivalence relations that induce a graph comprising cliques. While cliques are useful graph objects for answer selection, equivalence relations may be too restrictive for other problems (e.g., the relation is not transitive). However, our modular framework does apply to arbitrary graphs, except that Equation 3 will no longer be an exact convolution but be an approximation.
7. Related Work
Our work intersects two research areas; Answer Selection and handling multirelational social data, primarily via Graph Convolution.
Answer Selection In CQA forums, previous answer selection literature includes featuredriven models and deep text models.
FeatureDriven Models (BurelMA16; JendersKN16; TianZL13; TianL16) in CQA identify and incorporate user features, content features, and thread features, e.g., in treebased models (BurelMA16; JendersKN16; TianZL13) to identify the best answer. TianZL13 found that the best answer tend to be early and novel, with more details and comments. JendersKN16 trained classifiers for online forums, BurelMA16 emphasize the Q&A thread structure.
Deep Text Models (ZhangLSW17; WuWS18; WangN15; SukhbaatarSWF15) learn optimal QA textpair representations to select the best answer. FengXGWZ15 augment CNNs with discontinuous convolution for improved representations; WangN15 use stacked biLSTMs to match questionanswer semantics.
MultiRelational Approaches: While text and feature models treat answer content independently, we focus on integrating multirelational social aspects in the prediction. We identify a few related threads; adversarial approaches to integrate social neighbor data (adv_social; adv_neighbor); metalearning to adapt across data modalities or tasks (maml). Different from these directions, we focus on the simplicity of our multirelational graph formulation.
Graph Convolution is applied in spatial and spectral domains to compute graph node representations for downstream tasks including node classification (gcn), link prediction (relationalGCN), multirelational tasks (rase) etc. Spatial approaches employ random walks or khop neighborhoods to compute node representations (DeepWalk; node2vec; Planetoid; LINE) while fast localized convolutions are applied in the spectral domain(deferrard; duvenaund). Our work is inspired by Graph Convolution Networks (GCN) (gcn), which outperforms spatial convolutions and scales to large graphs. GCN extensions have been proposed for signed networks (signedgcn), inductive settings (graphsage), multiple relations (DualGCN; relationalGCN) and evolution (dysat). However, GCN variants assume label sharing, which cannot model contrastive relations in our setting.
8. Conclusion
This paper addressed the question of identifying the accepted answer to a question in CQA forums. We developed a novel induced relational graph convolutional (IRGCN) framework to address this question. We made three contributions. First, we introduced a novel idea of using strategies to induce different views on tuples in CQA forums. Each view consists of cliques and encodes—reflexive, similar, contrastive—relation types. Second, we encoded label sharing and label contrast mechanisms within each clique through a GCN architecture. Our novel contrastive architecture achieves Discriminative Magnification between nodes. Finally, we show through extensive empirical results on StackExchange that boosting techniques improved learning in our convolutional model. Our ablation studies show that the contrastive relation is most effective individually in StackExchange. As part of future work, we plan to include content into the convolutional framework.
Comments
There are no comments yet.