Syndrome-aware Herb Recommendation with Multi-Graph Convolution Network

by   Yuanyuan Jin, et al.
East China Normal University

Herb recommendation plays a crucial role in the therapeutic process of Traditional Chinese Medicine(TCM), which aims to recommend a set of herbs to treat the symptoms of a patient. While several machine learning methods have been developed for herb recommendation, they are limited in modeling only the interactions between herbs and symptoms, and ignoring the intermediate process of syndrome induction. When performing TCM diagnostics, an experienced doctor typically induces syndromes from the patient's symptoms and then suggests herbs based on the induced syndromes. As such, we believe the induction of syndromes, an overall description of the symptoms, is important for herb recommendation and should be properly handled. However, due to the ambiguity and complexity of syndrome induction, most prescriptions lack the explicit ground truth of syndromes. In this paper, we propose a new method that takes the implicit syndrome induction process into account for herb recommendation. Given a set of symptoms to treat, we aim to generate an overall syndrome representation by effectively fusing the embeddings of all the symptoms in the set, to mimic how a doctor induces the syndromes. Towards symptom embedding learning, we additionally construct a symptom-symptom graph from the input prescriptions for capturing the relations between symptoms; we then build graph convolution networks(GCNs) on both symptom-symptom and symptom-herb graphs to learn symptom embedding. Similarly, we construct a herb-herb graph and build GCNs on both herb-herb and symptom-herb graphs to learn herb embedding, which is finally interacted with the syndrome representation to predict the scores of herbs. In this way, more comprehensive representations can be obtained. We conduct extensive experiments on a public TCM dataset, showing significant improvements over state-of-the-art herb recommendation methods.



page 3

page 4

page 5

page 6

page 7

page 8

page 9

page 10


Multi-Graph Convolution Collaborative Filtering

Personalized recommendation is ubiquitous, playing an important role in ...

Conditional Generation Net for Medication Recommendation

Medication recommendation targets to provide a proper set of medicines a...

Drug Package Recommendation via Interaction-aware Graph Induction

Recent years have witnessed the rapid accumulation of massive electronic...

Relation Schema Induction using Tensor Factorization with Side Information

Given a set of documents from a specific domain (e.g., medical research ...

Local-Global Graph Clustering with Applications in Sense and Frame Induction

We present Watset, a new meta-algorithm for fuzzy graph clustering. This...

KQGC: Knowledge Graph Embedding with Smoothing Effects of Graph Convolutions for Recommendation

Leveraging graphs on recommender systems has gained popularity with the ...

RecoMed: A Knowledge-Aware Recommender System for Hypertension Medications

Background and Objective High medicine diversity has always been a signi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As an ancient and holistic treatment system established over thousands of years, Traditional Chinese Medicine (TCM) plays an essential role in Chinese society [cheung2011tcm]. The basis of the TCM theory is the thinking of holism, which emphasizes the integrity of the human body and its interrelationship with natural environments [Wang2014Zheng]. Fig. 1 takes the classic Guipi Decoction prescription as an example to show the three-step therapeutic process in TCM: (1) Symptom Collection. The doctor examines the symptoms of the patient. Here the symptom set contains “night sweat”, “pale tongue”, “small and weak pulse” and “amnesia”. (2) Syndrome Induction. Corresponding syndromes are determined after an overall analysis of symptoms. In this case, the main syndrome is “ deficiency of both spleen and blood” in a solid circle. As “pale tongue” and “small and weak pulse” can also appear under “the spleen fails to govern blood”, there is also an optional syndrome called “the spleen fails to govern blood” in a dotted circle. (3) Treatment Determination. The doctor chooses a set of herbs as the medicine to cure the syndromes. The compatibility of herbs is also considered in this step. Here the herb set consists of “ginseng”, “longan aril”, “angelica sinensis” and “tuckahoe”. As we can see, the second step of syndrome induction, which systematically summarizes the symptoms, is very critical for the final recommendation of herbs. However, as the above example shows, a symptom can appear in various syndromes, which makes the syndrome induction ambiguous and complex [TCM]. Actually, for a certain symptom set, different TCM doctors might give different syndrome sets (as shown in Fig. 1), and thus no standard ground truth exists.

Fig. 1: An example of the therapeutic process in TCM.

In a TCM prescription corpus, each data instance contains two parts — a set of symptoms and a set of herbs, which means the herb set can well cure the symptom set. To generalize to the unseen symptom set, the herb recommendation task focuses on modeling the interactions between symptoms and herbs, which is analogous to the traditional recommendation task that models the interactions between users and items  [wang2019neural]. Notably, one key difference is that in traditional recommendation, the prediction is mostly performed on the level of a single user, whereas in the herb recommendation, we need to jointly consider a set of symptoms to make prediction. Due to the lack of ground truth syndromes, existing efforts on herb recommendation [Yao2018ATM, ji2017latent, Ruan2019DiscoveringRF, ruan2019exploring] treat the syndrome concept as latent topics. However, they only learn the latent syndrome topic distribution given a single symptom. Particularly, they focus on modeling the interaction between one symptom and one herb; and then the interactions from multiple symptoms are aggregated to rank the herbs. As such, the set information of symptoms is overlooked.

In this paper, we propose to incorporate the implicit syndrome induction process into herb recommendation, which conforms to the intuition that syndrome induction is a key step to summarize the symptoms towards making effective herb recommendations, as shown in Fig. 1. Specifically, given a set of symptoms to treat, we aim to learn an overall implicit syndrome representation based on the constituent symptoms, before interacting with herbs for recommendation generation. Through this manner, the prescription behavior of doctors could be mimicked.

To this end, we propose a new method named Syndrome-aware Multi-Graph Convolution Network

(SMGCN), a multi-layer neural network model that performs interaction modeling between syndromes and herbs for the herb recommendation task. In the interaction modeling component (top layer of SMGCN), we first fuse the embeddings of the symptoms in a target symptom set via a Multi-Layer Perceptron (MLP) to directly obtain the overall implicit syndrome representation, which is later interacted with herb embeddings to output prediction scores. In the embedding learning component (bottom layers of SMGCN), we learn symptom embedding and herb embedding via GCN on multiple graphs. Specifically, in addition to the input symptom-herb graph, we further build symptom-symptom and herb-herb graphs based on the co-occurrence of symptoms (herbs) in prescription entries. Intuitively, some symptoms are frequently co-occurred in patients (e.g., nausea and vomit), modeling which is beneficial to symptom representation learning; similarly, the herb-herb graph evidences the frequently co-occurred herbs, which are useful for encoding their compatibility. We conduct experiments on a public TCM dataset 

[Yao2018ATM], demonstrating the effectiveness of our SMGCN method as a whole and validating the rationality of each single purposeful design.

The main contributions of this work are as follows.

  • We highlight the importance of representing syndromes and modeling the interactions between syndromes and herbs for herb recommendation.

  • We propose SMGCN, which unifies the strengths of MLP in fusion modeling (i.e., fusing symptom embeddings into the overall implicit syndrome embedding) and GCN in relational data learning (i.e., learning symptom and herb embeddings) for herb recommendation.

  • We build herb-herb and symptom-symptom graphs to enrich the relations of herbs and symptoms, and extend GCN to multiple graphs to improve their representation learning quality.

The rest of the paper is organized as follows: Section 2 describes the problem definition. Section 3 describes our overall framework. Section 4 introduces our proposed method. Section 5 evaluates our method. Section 6 surveys the related work. Finally, section 7 provides some concluding remarks.

Ii Problem Definition

The task of herb recommendation aims to generate a herb set as the treatment to a specific symptom set. Herb recommender systems usually learn from the large prescription corpus. Let = and = denote all symptoms and herbs, respectively. Each prescription consists of a symptom set and a herb set, e.g., =. In the syndrome induction process, an overall syndrome presentation needs to be induced for each symptom set, which is later used to generate an appropriate herb set. Hereafter we represent symptom set and herb set by = and =, respectively. In this way, each prescription is denoted by =.

Given a symptom set

, our task is to compute an N-dimensional probability vector, where the value of dimension

represents the probability that herb can cure . This is achieved by a learned prediction function =, where represents the probability vector, and indicates the trainable parameters of function . The input and output are defined as follows:

  • Input: Herbs , Symptoms , Prescriptions .

  • Output: A learned function , which generates the probability vector for all herbs from given the symptom set .

Fig. 2: The overall architecture of our proposed model (including Bipar-GCN, Synergy Graph Encoding (SGE) and Syndrome Induction (SI) ). Symptom nodes are in blue and herb nodes are in gray. Notably, the nodes with oblique lines are target nodes.

Iii Overview of Proposed Approach

In this section, we discuss the proposed Syndrome-aware Multi-Graph Convolution Network framework in detail, which is depicted in Fig. 2. Our proposed model takes a symptom set and all herbs as input, and outputs the predicted probability vector in dimension . In , the value at position indicates the probability that is appropriate to cure .

To complete this task, it mainly consists of two layers: the Multi-Graph Embedding Layer and Syndrome-aware Prediction Layer.

Multi-Graph Embedding Layer. This layer aims to obtain expressive representations for all symptoms from and all herbs from

. Considering the complex interrelations between symptoms and herbs in TCM, we first develop a Bipartite Graph Convolutional Neural Network (Bipar-GCN) to process the bipartite symptom-herb graph. To capture the intrinsic difference between symptoms and herbs, Bipar-GCN performs symptom-oriented embedding propagation for the target symptom node, and herb-oriented embedding propagation for the target herb node, respectively. Through this way, symptom embedding

and herb embedding are learned. Second, a Synergy Graph Encoding (SGE) component is introduced to encode the synergy information of symptom pairs and herb pairs. For symptom pairs, it constructs a symptom-symptom graph according to the concurrent frequency of symptom pairs and performs the graph convolution on the symptom-symptom graph to learn symptom embedding . Analogously, it also learns herb embedding from a herb-herb graph. Third, for each symptom (herb), two types of embeddings and from the Bipar-GCN and SGE are fused to form integrated embeddings .

Syndrome-aware Prediction Layer. In this layer, bearing the importance of syndrome induction process in mind, the Syndrome Induction (SI) component feeds the embeddings of all symptoms in the symptom set into an MLP to generate the overall syndrome representation . Second, all herb embeddings are stacked into , an matrix where is the dimension of each herb embedding. The syndrome representation interacts with to predict , the probability score vector for all herbs from .

Considering that a set of herbs will be recommended as a whole, a multi-label loss function is utilized to optimize our proposed model. All notations used in this paper are summarized in Tab. 


initial embeddings for herbs, symptoms
symptom collection and herb collection
collection of symptom sets
and collection of herb sets
prescription collection
neighborhood of symptom, herb
on the bipartite graph
symptom-herb graph
symptom-symptom graph and
herb-herb graph
threshold for constructing SS and HH
neighborhood of symptom on SS
neighborhood of herb on HH
message construction function for symptom, herb
at k-th Bipar-GCN layer
message aggregation function for symptom, herb
at k-th Bipar-GCN layer
aggregation function for symptom on SS
aggregation function for herb on HH
symptom, herb neighborhood embedding
at k-th Bipar-GCN layer
symptom, herb output embeddings at
k-th Bipar-GCN layer
symptom output embeddings on SS
herb output embeddings on HH
herb, symptom final embedding after fusion
the MLP weight matrix and bias parameter
used in Syndrome Induction
the attention network parameters in HeteGCN
the induced syndrome embedding
for symptom set sc
the predicted probability vector
for in dimension
TABLE I: Summary of all notations

Iv Methodologies

Iv-a Bipartite Graph Convolution Network

Recent works like [wang2019neural] have demonstrated the convincing performance of performing graph convolutions on the user-item graph in recommender systems. Despite their effectiveness, we argue that they ignore the intrinsic difference between the two types of nodes (users and items) in the bipartite graph and employ a shared aggregation and transformation function across the graph, which may restrict the flexibility of information propagation and affect the embedding expressiveness to some extent. To model the intrinsic difference between herbs and symptoms, we leverage Bipar-GCN, which is shown in Fig. 3. When the type of the target node is “symptom”, the left Symptom-oriented GCN will be used to obtain the representation for this target node. Otherwise, the right Herb-oriented GCN is adopted to learn the node embedding. These two parts share the same topological structure of the symptom-herb graph but adopt different aggregation and transformation functions. In this way, different types of nodes can develop their own propagation flexibility and therefore learn more expressive representations. Next we will introduce Bipar-GCN in detail.

Iv-A1 Symptom-Herb Graph Construction

Taking a TCM prescription === as an example, symptoms and herbs in the same prescription are related to each other. Therefore, constitute graph edges. We take the symptom-herb graph as an undirected graph which is formulated as follows:

wherein indicates the symptom-herb graph.

Fig. 3: Bipartite GCN. Blue edges and gray edges denote different graph convolution functions. The nodes with oblique lines are target nodes.

Iv-A2 Message Construction

In order to propagate information from each neighbor node to the target node, there are two operations to be defined: how to generate information that each node transfers to the target node and how to merge multiple neighbor messages together.

For symptom , the message its one-hop neighbor herb transfers to it is defined as ,


where is the initial embedding of herb . is the transformation weight matrix of the first-layer (symptom). After collecting messages from all neighbors, we choose average operation to merge them, which is defined as follows,


where is the one-hop neighbor set of and we choose

as the activation function. Analogously, for herb

, the merged one-hop neighbor message can be represented by,


where .

Iv-A3 Message Aggregation

After receiving the merged neighbor representation, the next step is to update the embedding for the target node. Here we adopt the proposed in [Hamilton2017InductiveRL], which concatenates two representations, followed by a nonlinear activation operation. The first-layer symptom representation and herb representation are defined as follows,


where indicates the concatenation operation of two vectors. and denote the aggregation weight matrices for symptoms and herbs, respectively.

Iv-A4 High-order Propagation

We can further extend the one-hop propagation rule to multiple layers. Specifically, in the k-th layer, we recursively formulate the representation of herb as,


wherein the message from neighbors in the k-th layer for is defined as follows,


For symptom , the formulations are similar,


wherein the message propagated within k-th layer for is defined as follows,


Iv-B Synergy Graph Encoding Layer

Except for the symptom-herb relation, there are also some synergy patterns within symptoms and herbs. Given a prescription =, symptoms in are not independent but related to each other, and herbs in also influence each other and form a complete composition. As such, these relations could be exploited to construct synergy graphs for symptoms and herbs, respectively. It is worth noting that although the two-order information propagation on the symptom-herb graph can capture the homogeneous relations between herbs and symptoms, the second-order symptom-symptom and herb-herb links are not equal to the concurrent pairs in prescriptions. For example, in prescriptions = and =, are the second-order neighbors of via the connections with . However, and do not appear with in the same prescription. Thus there will be no edges between the and and between and in the synergy graphs. On the other hand, it is obvious that the bipartite symptom-herb graph cannot be directly derived from the homogeneous synergy graphs. In consequence, we conclude that the symptom-herb graph and synergy graphs contain their own characteristics and can complement each other to learn more accurate node representations.

Iv-B1 Synergy Graphs Construction

Generally, the herb and symptom synergy patterns can be reflected by the high co-occurrence frequencies. Taking the construction of herb-herb graph as an example, we first compute the frequency of all herb-herb pairs in prescriptions: if herb and herb co-occur in the same , the frequency of pair is increased by 1. After obtaining the herb-herb frequency matrix, we manually set a threshold to filter the entries. For pairs co-occurring more than times, the corresponding entries are set to 1, and 0 otherwise. It is formulated as follows,

wherein denotes the herb-herb graph. is the threshold for herb-herb pairs. By referring to the above procedures, the symptom-symptom graph can be constructed as well.

Iv-B2 Information Propagation

Given the constructed herb-herb graph and symptom-symptom graph , we apply an one-layer graph convolution network to generate the symptom and herb embeddings:


wherein and are initial embeddings for symptom and herb respectively. indicates the neighbor set of in . indicates the neighbor set of in . and are weight parameters for and

, respectively. Through our local computation, the averages of node degrees show that the symptom-herb graph is much denser than the synergy graphs, and the standard deviations verify that the degree distributions of synergy graphs are smoother than that of the symptom-herb graph. Considering that we need to fuse

and lately, the sum aggregator is adopted for synergy graphs to make these two parts more balanced, which can benefit the training process to some extent.

From the view of herb recommendation task, and encode the synergy patterns in TCM, which further help improve the representation quality for symptoms and herbs. Besides, introducing additional information helps relieve the data sparsity problem [Ruan2019DiscoveringRF] of TCM prescriptions to some extent.

Iv-C Information Fusion

Up to now we have obtained two types of embeddings from Bipar-GCN and synergy graphs for each node. We employ the simple addition operation to merge these embeddings,


wherein and are the merged embeddings for symptom and herb , respectively.

To sum up, the above procedures clarify the proposed Multi-Graph Embedding Layer. It is a general architecture that can be used in other scenarios to model interactions between two types of objects. For example, in the recommendation scenario, Bipar-GCN can be exploited to capture the intrinsic difference between users and items. The additional user-user graph can be the social relation graph among users. The item-item graph can be item relations linked by items’ content attributes.

Iv-D Syndrome Induction

As aforementioned, syndrome induction plays an essential role in TCM clinical practice. Considering the ambiguity and complexity of syndrome induction, in this work, we propose an MLP-based method to consider the implicit syndrome induction process, which can depict the nonlinear interaction among symptoms and generate an overall implicit syndrome representation.

As Fig. 4 shows, we feed all symptom embeddings in a symptom set into an MLP to induce the overall implicit syndrome representation. Given a symptom set , first we represent it with a multi-hot vector. In this vector, if contains symptom , the corresponding entry is set to 1, and 0 otherwise. Second we look up the embedding for each symptom in and stack these vectors to build a matrix , where is the dimension of the single symptom embedding.

Fig. 4: The MLP-based method for syndrome induction.

Third, to induce an overall representation from , average pooling () is utilized. Further, considering the complexity of syndrome induction, we apply a single-layer MLP to transform the mean vector, which borrows the strength of nonlinearity in MLP to learn a more expressive syndrome representation. The above computation procedure is given as follows,


wherein means the induced syndrome embedding for .

Iv-E Training and Inference

In the herb recommendation scenario, given a symptom set, a herb set is generated to cure these symptoms. For each prescription, we need to evaluate the distance between the recommended herb set and the ground truth herb set, which is similar to the multi-label classification task. As Fig. 5 shows, the frequencies different herbs appear in prescriptions are imbalanced. Therefore, we need to resolve the label imbalance problem.

Fig. 5: Frequency distribution of the top 40 most frequent herbs.

Here, we use the following objective function (13) to characterize the above features in the herb recommendation scenario, where is the learned embedding matrix for the herb collection .


Given the input , the ground truth herb set is represented as a multi-hot vector in dimension . is the output probability vector for all herbs. controls the regularization strength to prevent overfitting. WMSE [Hu2019Sets2SetsLF] is weighted mean square loss between and , which is defined as follows,


The dimensions of and are both . and indicate the i-th entries in vectors respectively. is the weight for herb ,


wherein is the frequency of herb appearing in prescriptions. The adaptive weight setting is to balance the contribution of herbs with various frequencies. As we can see, the more frequently herb appears, the lower its weight is. We adopt Adam [Kingma2014AdamAM] to optimize the prediction model and update the model parameters in a mini-batch fashion.

Some researches argue that there are some patterns among different labels that can be exploited to improve the performance in multi-label classification. Zhang et al. [Zhang2006MultilabelNN] introduce a regularization term to maximize the probability margin between the labels belonging to a set and the ones not belonging to the set. However, the pair-wise margin is not reasonable in our scenario. The detailed discussion is in the experiments part.

Inference: Following the setting in [Hu2019Sets2SetsLF], we also adopt the greedy strategy to generate the recommended herb set. Specifically, we select the top herbs with the highest probabilities in as the recommended herb set for .

V Experiments

In this section, we evaluate our proposed SMGCN on the benchmark TCM dataset [Yao2018ATM]. There are several important questions to answer:

RQ1: Can our proposed model outperform the state-of-art herb recommendation approaches?

RQ2: Can our proposed model outperform the state-of-the-art graph neural network-based recommendation approaches?

RQ3: How effective are our proposed components (Bipar-GCN, Synergy Graph Encoding (SGE), and Syndrome Induction (SI))?

RQ4: How does our model performance react to different hyper-parameter settings (e.g., hidden layer dimension, depth of the GCN layers, and regularization strength)?

RQ5: Can our proposed SMGCN provide reasonable herb recommendation?

We first introduce the TCM data set, baselines, metrics, and experimental setup. Then the experimental results are demonstrated in detail. Last, we will discuss the influence of several critical hyperparameters.

V-a Dataset

To be consistent with work [Wang2019AKG], we conduct experiments on the benchmark TCM data set [Yao2018ATM]. The TCM data set contains 98,334 raw medical prescriptions and 33,765 processed medical prescriptions (only consisting of symptoms and herbs). As Fig. 6 shows, each prescription contains several symptoms and the corresponding herbs.

Fig. 6: The prescription example.

Among 33,765 processed medical cases, Wang et al. [Wang2019AKG] further select 26,360 prescriptions. The 26,360 medical cases are divided into 22,917 for training and 3,443 for testing. The statistics of the experimental data set is summarized in Tab. II.

Dataset #prescriptions #symptoms #herbs
All 26,360 360 753
Train 22,917 360 753
Test 3,443 254 558
TABLE II: Statistics of the evaluation data sets

V-B Evaluation

Given a symptom set, our proposed model generates a herb set to relieve the symptoms. To evaluate the performance of our approach, we adopt the following three measures commonly used in recommender systems. For all prescriptions in the test data set, they are defined by,


wherein is the top K herbs with the highest prediction scores given . The score indicates the hit ratio of top-K herbs as true herbs. In the experiments, we use to decide the optimal parameters. The describes the coverage of true herbs as a result in top-K recommendation. The (Normalized Discounted Cumulative Gain) accounts for the position of the hit herbs in the recommended list. If the hit herb ranks higher in the list, it gains a larger score. We truncate the ranked list at 20 for all three measures and report the average metrics for all prescriptions in the test set.

V-C Baselines

We adopt the following approaches for comparison.

Topic model

  • HC-KGETM [Wang2019AKG]: It integrates the TransE [Bordes2013TranslatingEF]

    embeddings obtained from a TCM knowledge graph into the topic model, to consider not only the co-occurrence information in TCM prescriptions but also comprehensive semantic relatedness of symptoms and herbs from the knowledge graph.

Graph neural network-based Models

  • GC-MC [berg2017graph]: This model leverages GCN [Kipf2016SemiSupervisedCW] to obtain the representations of users and items. To be consistent with the original work, we set one graph convolution layer in the experiment, and the hidden dimension equals the embedding size.

  • PinSage [Ying2018GraphCN]: PinSage is an industrial application of GraphSAGE [Hamilton2017InductiveRL] on item-item graph. In our setting, we apply it on the symptom-herb interaction graph. Specifically, we adopt two graph convolution layers following [Ying2018GraphCN], and the hidden dimension is the same as the embedding size.

  • NGCF [wang2019neural]: NGCF is the state-of-the-art graph-based collaborative filtering method. It explicitly constructs a bipartite user-item graph to model the high-order connectivity and obtain more expressive representations for users and items.

Our proposed models

  • HeteGCN: It is our proposed baseline which is built based on the heterogeneous graph-based GCN [zhang2019heterogeneous]. We integrate the symptom-herb graph, herb-herb graph, and symptom-symptom graph into one heterogeneous graph. For each node, there are two types of neighbors, symptom neighbors and herb neighbors. We apply the type-based attention mechanism to perform message construction. For symptom , the one-hop neighbor message is in (19) and (20), where = denotes the neighbor type set, is defined in (1), and indicates the concatenation operation. and are the attention network parameters. The information propagation is the same as (4). To notice that, symptom and herb nodes share the same network parameters. The formulas for herb nodes are similar. HeteGCN adopts the average pooling to do syndrome induction, and multi-label loss is defined similar to (13). The depth of GCN is set to 1 with hidden dimension of 128 for better performance.

  • SMGCN: The proposed approach learns multiple graphs (i.e., the symptom-herb bipartite graph, symptom-symptom graph, and herb-herb graph), and performs graph convolution on them to describe the complex relations between symptoms and herbs from TCM. In the prediction layer, we design an MLP-based method to induce the overall implicit syndrome representation for each symptom set. As a result, it is significantly different compared with existing herb recommendation methods.

V-D Parameter Settings

We implement our approach and the comparative methods using Tensorflow. For the topic model HC-KGETM, we follow the parameter settings in

[Wang2019AKG]. Grid search is conducted to search the optimal learning rate , the regularization coefficient and the dropout ratio. Specifically, is varied in , is tuned in , and the dropout rate is searched in . We use Xavier initializer [Glorot2010UnderstandingTD] and Adam optimizer [Kingma2014AdamAM] to train models with the batch size of 1024.

For graph neural network baselines, the embedding size and the latent dimension are both set to 64. For our proposed SMGCN and HeteGCN, the embedding size is fixed to 64, and the dimension of the first output layer is 128. The last layer dimension is searched in . The GCN layer depth is tuned in . The optimal parameter settings are summarized in Tab. III. Without specification, the following performances of our SMGCN model are with 2 GCN layers and the last layer dimension of 256.

Approaches Best parameter settings
= 0.05 = = 0.01 = 1
lr = 9e-4 dropout = 0.0 = 1e-6
lr = 9e-4 dropout = 0.0 = 1e-3
lr = 3e-3 dropout = 0.0 = 1e-5
lr = 3e-3 dropout = 0.0 = 1e-3
=5 =40
lr = 2e-4 dropout = 0.0 = 7e-3
=5 =40
TABLE III: Optimal parameters of comparative models

V-E Performance Comparison

In this part, we firstly demonstrate the overall results among different methods, with their optimal parameter settings. Next, we conduct some ablation analysis to verify the effectiveness of different model components. Then we discuss the influence of hyperparameters in detail.

V-E1 Overall Result


Tab. IV demonstrates the overall performances. To notice that, the original graph neural network-based baselines do not apply Syndrome Induction (SI) and multi-label loss functions. For a fair comparison, we modify GC-MC, PinSage and NGCF by adding the SI part and employing multi-label loss function defined in (13). We can observe that:

Approaches p@5 p@10 p@20 r@5 r@10 r@20 ndcg@5 ndcg@10 ndcg@20
HC-KGETM 0.2783 0.2197 0.1626 0.1959 0.3072 0.4523 0.3717 0.4491 0.5501
GC-MC 0.2788 0.2223 0.1647 0.1933 0.3100 0.4553 0.3765 0.4568 0.5610
PinSage 0.2841 0.2236 0.1650 0.1995 0.3135 0.4567 0.3841 0.4613 0.5647
NGCF 0.2787 0.2219 0.1634 0.1933 0.3085 0.4505 0.3790 0.4571 0.5599
HeteGCN 0.2864 0.2268 0.1676 0.2018 0.3192 0.4667 0.3837 0.4620 0.5665
SMGCN 0.2928 0.2295 0.1683 0.2076 0.3245 0.4689 0.3923 0.4687 0.5716
%Improv. by HC-KGETM 5.22% 4.44% 3.52% 5.95% 5.63% 3.67% 5.55% 4.36% 3.90%
%Improv. by PinSage 3.09% 2.61% 2.02% 4.02% 3.49% 2.68% 2.13% 1.60% 1.23%
%Improv. by HeteGCN 2.24% 1.17% 0.44% 2.87% 1.66% 0.46% 2.24% 1.45% 0.90%
TABLE IV: The overall performance comparison. HC-KGETM utilizes log-loss but without SI. HeteGCN utilizes multi-label loss but without SI. The other models are with SI and adopt multi-label loss. The second best results are underlined. p@k and r@k are short for precision@k and recall@k
  • Our proposed SMGCN performs the best among the comparative approaches. Specifically, SMGCN outperforms the topic-model HC-KGETM in terms of p@5 by , r@5 by and ndcg@5 by . Besides, as for the strongest baseline HeteGCN, SMGCN outperforms it in terms of p@5 by , r@5 by , and ndcg@5 by . For the second best baseline PinSage, SMGCN surpasses it in terms of p@5 by , r@5 by , and ndcg@5 by .

  • HC-KGETM almost performs the worst for all metrics. The reasons may contain two aspects: 1) at the interaction-modeling stage, it only ranks the candidate herbs based on each single symptom and ignores the symptom set information; 2) at the embedding learning step, it adopts TransE [Bordes2013TranslatingEF] to capture the information in a TCM knowledge graph. Compared to the translation-based graph embedding method, the graph neural networks are superior in explicitly exploiting the high-order connectivity.

  • Among GC-MC, PinSage, and NGCF, NGCF performs the worst, and PinSage performs the best. Comparing GC-MC with NGCF, GC-MC performs slightly better than NGCF. Considering that GC-MC only utilizes the first-order neighbors, the multiple graph convolution layers of NGCF may cause overfitting and hurt the performance. Further, PinSage, GC-MC, and NGCF have various propagation functions: PinSage concatenates representations of the target node and the neighbor nodes, GC-MC sums these two representations, and NGCF additionally integrates the element-wise product part of the target node and the neighbor node when constructing messages. It seems that the concatenation operation is more effective in capturing the rich relations in prescriptions, compared with the element-wise product or sum-up operations.

  • HeteGCN outperforms PinSage, which shows that additionally integrating the herb-herb and symptom-symptom concurrent relations can introduce more information. However, SMGCN is still superior to HeteGCN, which verifies that the Multi-Graph GCN framework can learn a more flexible and expressive model to some extent compared with the unified heterogeneous-graph based GCN, and the choice of MLP is appropriate to depict the complex syndrome induction process.

V-E2 Ablation Analysis


To better understand our proposed SMGCN model, we split the whole model into three components: Bipar-GCN, SGE, and Syndrome Induction (SI) to evaluate their contribution to the unified herb recommender system, respectively. To notice that, in Bipar-GCN, we only use average pooling to do syndrome induction for each symptom set. In the SI part, we adopt the average pooling followed by an MLP transformation. Tab. V shows the performance of the ablation analysis. Here the output embedding size is set to 256, and the graph convolution layer is set to 2. Among the submodels, Bipar-GCN and Bipar-GCN w/ SI do not contain the synergy graphs. Thus instead of the heterogeneous graph-based HeteGCN, we adopt the simpler baseline PinSage to be compared with all the submodels. We have the following observations:

Submodels p@5 r@5 ndcg@5
PinSage 0.2841 0.1995 0.3841
Bipar-GCN 0.2859 0.2003 0.3820
Bipar-GCN w/ SGE 0.2916 0.2064 0.3900
Bipar-GCN w/ SI 0.2914 0.2060 0.3885
SMGCN 0.2928 0.2076 0.3923
TABLE V: Performance of different submodels
  • From a whole perspective, all the three components of our proposed model, i.e., Bipar-GCN, SGE, and SI, are verified to be effective for their better performance in comparison.

  • Comparing Bipar-GCN with Bipar-GCN w/ SI, it is observed that the choice of MLP is superior to only employing average pooling, which verifies that the nonlinear transformation in MLP helps model the complex relations among symptoms and further generate a high-quality implicit syndrome representation.

  • For both Bipar-GCN and Bipar-GCN w/ SI, integrating Synergy Graph Encoding (SGE) leads the further improvement, which shows that the architecture of multiple graphs in the embedding learning layer not only is beneficial for learning more expressive representations but also assist in predicting herbs.

  • SMGCN, the combination of Bipar-GCN, SGE, and SI, achieves the best performance, indicating that modeling the nonlinearity in the syndrome inducing process and unifying complex relations through multiple graphs is effective in the herb recommendation scenario.

V-E3 Influence of Hyperparameters


In this part, we will discuss the key factors in detail.

  • Effect of Layer Numbers

To explore whether our proposed model can benefit from a larger number of embedding propagation layers, we tune the number of GCN layers on the submodel Bipar-GCN w/ SI, which is varied in . The dimension of the last layer is set to 256. We have the following observations from Tab. VI:

  • Our proposed Bipar-GCN w/ SI is not very sensitive to the depth of propagation layers. The two-layer model performs marginally better compared to one-layer’s performance.

  • When further increasing the layer number to three, it seems that the performance drops a little compared to one layer. The reason may be overfitting caused by large propagation depth.

  • When varying the depth of propagation layers, our Bipar-GCN w/ SI consistently outperforms the strongest baseline HeteGCN. It again verifies the effectiveness of the SI part, empirically showing that the nonlinearity of MLP can help depict the complex syndrome induction process.

depth p@5 p@20 r@5 r@20 ndcg@5 ndcg@20
1 0.2898 0.1688 0.2044 0.4702 0.3864 0.5684
2 0.2914 0.1690 0.2060 0.4695 0.3885 0.5699
3 0.2882 0.1684 0.2030 0.4684 0.3869 0.5693
TABLE VI: Effect of layer numbers on Bipar-GCN w/ SI
  • Effect of Final Embedding Dimension

The dimension of the embedding layer can influence the performance a lot. We conduct the experiments on our proposed SMGCN approach, and the depth of embedding propagation is set to 2. Tab. VII shows the experimental results according to various dimensions of the last output layer. With the output dimension increasing, there is a consistent improvement with a larger embedding dimension until dimension to be 256. When the dimension rises at 512, the performance drops slightly but is still superior to the second strongest baseline PinSage. However, when the dimension drops to 64, our model underperforms PinSage in terms of r@20 and ndcg@20 slightly. This observation denotes that our proposed model depends on a reasonably large embedding dimension to have sufficient flexibility for constructing useful embeddings.

dimension p@5 p@20 r@5 r@20 ndcg@5 ndcg@20
64 0.2857 0.1651 0.1999 0.4554 0.3847 0.5627
128 0.2882 0.1670 0.2018 0.4631 0.3853 0.5660
256 0.2928 0.1683 0.2076 0.4689 0.3923 0.5716
512 0.2922 0.1673 0.2068 0.4632 0.3930 0.5700
TABLE VII: Effect of last layer dimensions on SMGCN
  • Effect of Frequency Thresholds in Synergy Graphs

The Synergy Graph Encoding (SGE) component containing symptom-symptom graph and herb-herb graph contributes a lot to our proposed SMGCN. These two graphs are used to reflect the concurrency patterns between herb-herb pairs and symptom-symptom pairs, which play an important role in TCM theory. There are two hyperparameters controlling the construction of synergy graphs, threshold for herb-herb co-occurrence and for symptom-symptom co-occurrence. For instance, if the symptom-symptom pair occurs in prescriptions more than times, then edge is added into the symptom-symptom graph. We fix to 5 and tune varied in . Fig. 7 shows the experimental results for different thresholds.

[precision@5]  [recall@5]  [ndcg@5]

Fig. 7: Performance for different thresholds on SMGCN.

We show the metrics in terms of topk=5, and the slightly better performance is achieved at =. When the threshold is low, the herb-herb graph is relatively dense, but it may contain some noise. As the threshold increases, the graph becomes sparse, and some useful information may be filtered. Therefore, finding an appropriate threshold seems to affect the construction of synergy graphs.

  • Effect of Regularization

Due to the strong expressiveness of neural networks, it is easy to overfit the training data. The typical approaches to prevent overfitting contain regularization term and the dropout of neurons. In our setting,

controls the regularization strength on parameters, and the dropout ratio controls the ratio of removed neurons in the training process. Fig. 8 demonstrates the influence of and Fig. 9 depicts the influence of the dropout ratio, where the dimension is set to 256, and the depth is set to 2. From Fig. 8, we observe that our model achieves slightly better performance when equals 7e-3. Larger might result in under-fitting and hurt the performance. Smaller might be weak to prevent the overfitting trend in the training process.

[precision@5]  [recall@5]  [ndcg@5]

Fig. 8: Performance for different on SMGCN.

As for the dropout technique, instead of dropping out some nodes completely with a certain probability, we only employ message dropout on the aggregated neighborhood embeddings, making our model more robust against the presence or absence of single edges. It can be observed that the performance drops with the increasing dropout ratio, which indicates that the above regularization term is sufficient enough to prevent the overfitting trend.

[precision@5]  [recall@5]  [ndcg@5]

Fig. 9: Performance for different dropout ratios on SMGCN.
  • Effect of Loss Function

In Tab. IV, we align GNN based baselines (i.e., GC-MC, PinSage, and NGCF) with our proposed SMGCN by adding the SI component and employing multi-label loss on them. Therefore, the performance comparison only verifies the effectiveness of the embedding learning layer in our model. We are also curious about the effectiveness of the different embedding layer and prediction layer combinations. We select NGCF from the baselines as a representative method, and the comparative loss function is the common-used pair-wise BPR [Rendle2009BPRBP]. The experimental results are summarized in Tab. VIII. As for BPR loss, Bipar-GCN w/ SI performs better. For multi-label loss, Bipar-GCN w/ SI is superior in all the metrics. It verifies that separately learning symptom and herb representations can help obtain more expressive embeddings. Besides, multi-label loss also outperforms BPR loss, which tells that multi-label loss is more appropriate for herb-recommendation task than BPR loss. We will give the reasons in detail. In a TCM prescription, the herb set is generated according to herb compatibility rules, which heavily depend on the TCM doctors’ individuals experiences. For the same symptom set, there may be multiple herb sets as the remedy. Therefore, when herb A occurs in a prescription, it does not mean that A is more appropriate than every missing herb B. It just indicates that herb B is not reasonable to join the current herb set due to some herb compatibility rules. Different from BPR, multi-label loss computes the distance between the recommended herb set with the ground truth herb set, which evaluates the results from the set view. It is also the reason we do not add the positive-negative label margin constraint [Zhang2006MultilabelNN] into our loss function in (13).

Approaches p@5 p@20 r@5 r@20 ndcg@5 ndcg@20
0.2760 0.1606 0.1953 0.4472 0.3825 0.5624
Bipar-GCN w/ SI
0.2774 0.1623 0.1951 0.4479 0.3762 0.5565
0.2787 0.1634 0.1933 0.4505 0.3790 0.5599
Bipar-GCN w/ SI
0.2914 0.1690 0.2060 0.4695 0.3885 0.5699
TABLE VIII: Comparison of different loss functions

V-E4 Case Study


In this part, we conduct a case study to verify the rationality of our proposed herb recommendation approach. Fig. 10 shows two real examples in the herb recommendation scenario. Given the symptom set, our proposed SMGCN generates a herb set to cure the syndrome with the listed symptoms. In the Herb Set column, the bold red font indicates the common herbs between the herb set recommended by SMGCN and the ground truth. According to the herbal knowledge, the missing herbs actually have similar functions with the remaining ground-truth herbs and can be alternatives in clinical practice. Through the above comparative analysis, we can find that our proposed SMGCN has the ability to provide reasonable herb recommendations.

Fig. 10: The herb recommendation cases.

Vi Related Work

Vi-a Herb Recommendation

Prescriptions play a vital role in TCM inheritance of clinical experience and practice. The development history of TCM prescription mining contains three stages: 1) traditional frequency statistic and data mining techniques, mainly including association analysis, simple clustering, and classification methods; 2) topic models. Existing researches [ma2016discovering, Fan2016TCM, Wang2016ACP, Chen2018HeterogeneousIN, Ruan2017THClusterHS, Yao2018ATM, ji2017latent, Wang2019AKG] compute the conditional probability of the co-occurred symptom and herb words to capture the relations among symptoms and herbs; and 3) graph model-based methods. Studies [li2018exploration, li2017distributed, Ruan2019DiscoveringRF, ruan2019exploring] organize TCM prescriptions into graphs to capture the complex regularities. Because the methods in the first category are only suitable for a single disease, we mainly focus on the second and third categories.

Topic Model Based Herb Recommendation. Topic models are applied to process prescriptions in natural languages, where TCM prescriptions are documents containing herbs and symptoms as words. The beneath motivation is that herbs and symptoms occurring under the same topic are similar. Ma et al. [ma2016discovering] propose a “symptom-syndrome” model to mine the correlation between symptoms and latent syndrome topics. Ji et al. [ji2017latent] consider “pathogenesis” as the latent topics to connect symptoms and herbs. Lin et al. [Fan2016TCM] jointly model symptoms, herbs, diagnosis, and treatment in prescriptions through topic models. Wang et al. [Wang2016ACP] design an asymmetric probability generation model to model symptoms, herbs, and diseases simultaneously. Yao et al. [Yao2018ATM] integrate TCM concepts such as “syndrome”, “treatment,” and “herb roles” into topic modeling, to better characterize the generative process of prescriptions. Chen et al. [Chen2018HeterogeneousIN] and Wang et al. [Wang2019AKG] introduce TCM domain knowledge into topic models to capture the herb compatibility regularities.

Unfortunately, standard topic models are not very friendly to short texts. Thus, the sparsity of prescriptions [Ruan2019DiscoveringRF] will limit the performance of topic models on large-scale prescriptions to some extent. Besides, they cannot analyze the complex interrelationships among various entities comprehensively.

Graph Based Herb Recommendation.

A graph is an effective tool to model complex relation data. Graph representation learning-based herb recommendation is a hot research topic nowadays, which mainly focuses on obtaining the low-dimensional representations of TCM entities, and then recommends herbs based on the embeddings. Some researches have introduced deep learning techniques into graph-based prescription mining. Li et al.

[li2018exploration] utilize the attentional Seq2Seq [Zhang2019Seq2SeqAS] to design a multi-label classification method, in order to automatically generate prescriptions. Li et al. [li2017distributed] adopt the BRNN [Schuster1997BidirectionalRN] to do text representation learning for the herb words in the TCM literature for treatment complement task. [Ruan2019DiscoveringRF, ruan2019exploring]

integrate the autoencoder model with meta-path to mine the TCM heterogeneous information network.

The weak point of the above graph-based models is that the applied deep learning techniques are initially designed for the euclidean space data and lack the interpretability and reasoning ability for the non-euclidean space graph data.

Vi-B Graph Neural Networks-based recommender systems

Graph neural networks (GNNs) are the extension of neural networks on the graph data, which can handle both node features and edge structures of graphs simultaneously. Due to its convincing performance and high interpretability, GNNs have been widely applied in recommender systems recently. GNNs are applied to different kinds of graphs as follows: 1) User-item Interaction Graphs: Berg et al. [berg2017graph] present a graph convolutional matrix completion model based on the auto-encoder framework. Wang et al. [wang2019neural] encode the collaborative signal in the embedding process based on GNN, which can capture the collaborative filtering effect sufficiently; 2) Knowledge Graphs: Wang et al. [Wang2018RippleNetPU] propose the Ripple Network, which iteratively extends a user’s potential interests along edges in a knowledge graph to stimulate the propagation of user preferences. Wang et al. [wang2019kgat] propose Knowledge Graph Attention Network, which recursively propagates the embeddings from a node’s neighbors to obtain the node embedding, and adopts the attention mechanism to discriminate the importance of the neighbors; 3) User Social Networks: Wu et al. [wu2018socialgcn] and Fan et al. [fan2019graph] apply GCNs to capture how users’ preferences are influenced by the social diffusion process in social networks; 4) User Sequential Behavior Graphs: Wu et al. [wu2019session] and Wang et al. [wang2020gf] apply GNN for session-based recommendation by capturing complex transition relations between items in user behavior sequences.

Vii Conclusion and Future Work

In this paper, we investigate the herb recommendation task from the novel perspective of taking implicit syndrome induction into consideration. We develop a series of GCNs to simultaneously learn the symptom embedding and herb embedding from the symptom-herb, symptom-symptom, and herb-herb graphs. To learn the overall implicit syndrome embedding, we feed multiple symptom embeddings into an MLP, which is later integrated with the herb embeddings to generate herb recommendation. The extensive experiments carried out on a public TCM dataset demonstrate the superiority of the proposed model, validating the effectiveness of mimicking the syndrome induction by experienced doctors.

In future work, for embedding learning, we will improve the embedding quality of the TCM entities by adopting advanced techniques such as the attention mechanism. For graph construction, we will introduce more TCM domain-specific knowledge, including dosage and contraindications of herbs into the TCM graphs.