Distributed Representation for Traditional Chinese Medicine Herb via Deep Learning Models

11/06/2017 ∙ by Wei Li, et al. ∙ Peking University 0

Traditional Chinese Medicine (TCM) has accumulated a big amount of precious resource in the long history of development. TCM prescriptions that consist of TCM herbs are an important form of TCM treatment, which are similar to natural language documents, but in a weakly ordered fashion. Directly adapting language modeling style methods to learn the embeddings of the herbs can be problematic as the herbs are not strictly in order, the herbs in the front of the prescription can be connected to the very last ones. In this paper, we propose to represent TCM herbs with distributed representations via Prescription Level Language Modeling (PLLM). In one of our experiments, the correlation between our calculated similarity between medicines and the judgment of professionals achieves a Spearman score of 55.35 indicating a strong correlation, which surpasses human beginners (TCM related field bachelor student) by a big margin (over 10



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traditional Chinese Medicine (TCM) has accumulated a large amount of data during the long term of development, a big part of which embodies as TCM prescriptions. TCM herbs, also known as materia medica, is one of the most important ways of TCM treatment, whose form is the prescriptions that doctor gives based on his/her observation and judgment on the patients’ condition.

The prescriptions consist of various kinds and doses of herbs. We show an example of a famous TCM prescription called Xiao Chai Hu Tang (小柴胡汤) in Table 1

. Doctors would adjust the doses of the herbs according to the specific condition of the patient. The herbs have their own natures, for instance, ”warm (温)” , ”cool (凉)” , ”cold (寒)” and ”hot (热)” . Apart from this, the compatibility of medicines also plays a very important role, for example, some certain patterns of combination are strictly prohibited in TCM guidance called ”eighteen pairs of strictly prohibited medicine combination (十八反) ”. This indicates that modeling the matching patterns behind the herbs in the prescription is necessary if we want to bring Artificial Intelligence into TCM treatment procedure.

As the development of data driven kind of machine learning algorithms like deep learning, it has achieved significant improvement in the natural language processing (NLP) field. For instance, neural machine translation

(Bahdanau et al., 2014; Sutskever et al., 2014; Sun et al., 2017)

, text summarization

(Ma and Sun, 2017; Ma et al., 2017) question answering (Rajpurkar et al., 2016; Wang and Jiang, 2016), automatic dialogue generation (Li et al., 2016) and so on. How to apply deep learning to TCM field, which seems relevant to NLP, then becomes an interesting question.

Xiao Chai Hu Tang (小柴胡汤)
Composition radix bupleuri, Pinellia ternata, ginseng, licorice root, Scutellaria baicalensis, ginger, Chinese-date
Translation 柴胡、半夏、人参、甘草、黄芩、生姜、大枣。

Table 1: An example of TCM prescription

Previous works have attempted to use probabilistic topic models such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) to describe the properties of the herbs (Zhang, 2011; Zhipeng et al., 2017)

. In NLP field, the neural network based word embedding models

(Mikolov et al., 2013b; Pennington et al., 2014) have achieved great success, and laid a good foundation for the development and application of deep learning models (Collobert et al., 2011). In this paper, we propose to learn the distributed representation of TCM herbs by a way analogous to the word embeddings in NLP, which can be hopefully helpful to the further development of TCM research.

However, TCM prescriptions are not exactly like natural language sentences. TCM prescriptions have their own way of organizing the herbs, which are often put in a weakly ordered way. The herbs in the front of the prescription may be connected with the very last herb instead of the surrounding ones. In our learning process, We see each prescription as a sequence of tokens. The herbs form the context to each other. By predicting the central herb with the corresponding context, we can learn the representation of each herb, which contains the information of the patterns of combination, indicating some of the properties of the herbs. In our experiments we see that first modeling the prescription as a whole provides much better results than traditional language modeling style methods.

Although there has been thousands of TCM prescriptions in the history, because of the lack of digitalization, there has not been much structured digital resources. In this paper, we collect large scale digital resources from the Internet. After some steps of formalization and cleaning of the data, we get over 80,000 TCM prescriptions. By predicting the randomly chosen central herb based on the corresponding context, we can learn the representation of each herb, which contains the information of the patterns of combination. In this paper we propose a Prescription Level Language Modeling (PLLM) that predicts the central herb by first modeling the whole prescription. In our experiments we observe that our PLLM method performs much better than traditional language modeling style methods. Apart from these, we also propose one possible way of applying deep learning to assist doctors in TCM treatment in the real life.

Our contributions mainly lie in the following aspects:

  • We clean and formalize a large scale of TCM data from the Internet and provide a dataset for training and testing the quality of herb embeddings.

  • We propose to represent TCM herbs with distributed embeddings, and propose a Prescription Level Language Modeling (PLLM) method to learn the distributed representations of the TCM herbs. In the experiments we see that modeling the prescription as a whole is better than directly applying language modeling method.

  • We propose a possible way to assist TCM doctors to compose prescriptions with deep learning methods.

2 Related Work

2.1 Computational TCM Methods

Zhou et al. (2010) attempted to build TCM clinical data warehouse to make use of TCM knowledge. This is a typical way of collecting big data, since the number of prescriptions given by the doctors in the clinics is very large. However, in reality, besides the problem of quality, most of the TCM doctors don’t use these digital systems. Therefore, we choose prescriptions in the traditional classics of TCM. Although this may suffer from the loss of data magnitude, it guarantees the quality of the prescriptions.

Wang et al. (2004)

attempted to construct a self-learning expert system with several simple classifiers to facilitate the TCM diagnosis procedure,

Wang (2013) proposed to use deep learning and CRF based multi-labeling learning methods to model TCM inquiry process, but their systems are too simple to be actually effective in the real life TCM diagnosis. Lukman et al. (2007) made a survey on some computational methods on TCM, while these methods utilize traditional data mining methods.

2.2 Distributed Word Embedding

Bengio et al. (2003)

first proposed to learn the distributed representation of words while predicting the next word in a sequence to fight the curse of dimensionality.

Mikolov et al. (2010)

followed this thread by expanding the simple feed forward neural networks to recurrent neural networks, hoping to capture longer distance dependency. These two models still largely resembles the framework of probabilistic language modeling.

Mikolov et al. (2013b, a)

proposed two very simple yet effective models called continuous bag of words (CBOW) and skip gram. CBOW predicts the central word in a context window based on the words in the window with a simple logistic regression classifier. Skip gram uses the same architecture but predicts the context words based on the central word. Although these two models achieved very good results in many kinds of tasks, they suffer from the loss of not being able to utilize global information. To tackle this problem,

Pennington et al. (2014)

proposed a Global Vector model (GloVe), which aims to combine the advantage of both LSA model and the CBOW model. We develop our methods to learn the distributed representations of herbs inspired by the above ideas while modeling the prescription as a whole rather than using limited context window.

3 Data Construction

When constructing our TCM prescription dataset, we first considered the TCM medical records (中医医案) in the history, which contain a lot of very good resource. The medical records are widely referenced by the doctors in the treatment, however, they have not been well digitalized, which makes it hard to extract the prescriptions out of the descriptive natural language from the records. Another way to get large scale prescriptions is from TCM clinics, the problem is that most of this kind of valuable data is not publicly available. Therefore, we turn to Internet resources, which contain large scale digitalized prescription resources.

We crawl the data from TCM Prescription Knowledge Base (中医方剂知识库) 111http://www.hhjfsl.com/fang/. This knowledge base includes quite comprehensive TCM documentations in the history, which also provides a search engine for prescriptions. The database includes 710 TCM historical books or documentations as well as some modern ones, consisting of 85,166 prescriptions in total. Each item in the database provides the name, origin, compositions, effect, prescription, contraindications and preparation methods. We clean and formalize the database and get 85,161 usable prescriptions222The data and processing code are all available on-line.

In the process of normalization, we temporarily omit the dose information and the preparation method description, which we may use in the future. Word segmentation is typically the first step to Chinese text processing (Xu and Sun, 2016; Zhao et al., 2010; Sun et al., 2014, 2012, 2009)

. Word segmentation is used to pre-process the text into word based sequences. In addition to the traditional word segmentation techniques, we use more heuristics to assist the segmentation process because this domain has specific features. We also write some simple rules to project some rarely seen herbs to their similar form that is normally referred to. For example, if the herb appears less than 5 times and all the characters of the herb name is a substring of another more popular herb, then the herb would be mapped to the other one. This simple projection procedure can partly solve the data sparsity problem.

3.1 HerbSim80

Similar to the way of building wordsim353 (Finkelstein et al., 2001) , we manually build a dataset consisting of 80 pairs of herbs, which we ask three TCM professionals to make a judgment on how likely the two herbs in the pair would appear in the same prescription. We then evaluate the embeddings by calculating the correlation between the similarity scores given by the cosine distance of embeddings and the scores given by the professionals. In Table 2 we show some examples.

Herb 1 Herb 2 S 1 S 2 S 3 Ave S

栀子 1 1 1 1.00
麦门冬 山茱萸 3 3 3 3.00
赤芍 藿香 2 2 1 1.67
苍术 乌梅肉 2 1 1 1.33

Table 2: Examples of HerbSim80. The Chinese characters in Herb 1 and Herb 2 two columns are all herb names. The scores in S 1, S 2 and S 3 are scores given by the three professionals. Ave S is the average of three scores.444乌头

The detailed procedure is as follows:

  1. We randomly generate 120 pairs of TCM herbs.

  2. We invite three TCM professionals, who have been practicing TCM diagnosis and treatment for over five years, to give a score of the herb pair between 1 and 5. 1 indicates that the two herbs are very unlikely to appear together in one prescription. On the contrary, 5 indicates that the two herbs often appear as a pair in the same prescription.

  3. We rank the pairs by the standard deviation between the three scores given by the professionals, and get the top 80 pairs with better agreement. The final score is set to be the average of the three scores.

  4. We invite a junior student who majors in TCM (student who has just finished the course of Principles of TCM Prescriptions) to do the task, which will be compared with the result given by the embeddings.

4 Distributed Representations of TCM Herb

Similar to the one-hot representation in NLP, herbs can also be represented as one-hot vectors, where the length of the vector is the size of the whole herb vocabulary and in each vector, only one slot is filled with “1” while others are all “0”. The problem of this way of representation is that it can not show the innate relation between herbs, which is even more important than it is in NLP. For example, cinnamon(肉桂) and cinnamon twig (桂枝) are two different parts of the same plant cinnamon tree. The natures of these two herbs are very similar, but in the one-hot style representation, the distance between these two herbs are no different from any other pairs.

Another possible way of representing herbs is to model the herbs with some features that how TCM experts view the herbs. For each aspect of the herb, we can use one-hot vectors to represent them. This way of representation accords to the theory of TCM research in the history. For example, assume we model each herb from cold and hot (寒热)、the five flavors (五味 : sweet, sour, bitter, pungent and salty) two aspects, we can represent a herb with a 7-length vector. However, to make this work, it costs very expensive human expert effort, which makes it impracticable.

Inspired by the way of representing words with distributed vectors in NLP field, we propose to represent the TCM herbs with distributed representations. We model the TCM prescriptions as documents in NLP, while herbs as words. We can automatically learn the herb embeddings by tuning the distributed representation of herbs while predicting the central herb with context herbs in the prescription. This way the information of the herb is implicitly embodied in the vectors, and we can learn the representations automatically from the dataset we build without much human effort.

Although TCM prescriptions are very similar to natural language texts, there is one major difference that in natural language, the order of the words is very important which is strictly restricted by syntax and grammar, while in prescriptions, the order of the herbs usually plays less important roles. On the other hand, the herbs in the front of a prescription may be connected to the very last one instead of its surrounding ones. Based on this observation, we propose to first model the prescription as a whole, and then predict the central herb.

4.1 Proposed Baselines

In this subsection, we propose several baseline models that are directly adapted from NLP field.

  • Latent Semantic Analysis (LSA) : LSA is used to analyze the relationship between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. We model the herbs as words and prescriptions as documents. We use a matrix containing herb counts in prescriptions to represent the coexistence relation between herbs and prescriptions. With singular value decomposition, we can get a vector for each herb, which is then used as the distributed representation of the herbs.

  • Continuous bag of words (CBOW)(Mikolov et al., 2013b)

    : This model uses a log-linear regression model with negative sampling to predict a central word with the context words within a local context window.

  • Global vectors for word representation (GloVe)(Pennington et al., 2014) : This model uses a global log-bilinear regression model that combines the advantages of the global matrix factorization (similar to LSA) and local context window (similar to CBOW) methods.

  • Recurrent Neural Networks Language Modeling (RNNLM) : This model is similar to CBOW model in the setting of objective, that is to predict a herb based on their context herbs. However, this model aims to model a longer dependency between herbs by using a bidirectional gated recurrent neural networks (BiGRNN) to predict the central herb. This model predicts the central herb by considering the herbs both before and after it.

4.2 Prescription Level Language Modeling

In this subsection, We show the details of our proposed methods, Prescription Level Language Modeling(PLLM).

Figure 1: An illustration of Prescription Level Language Modeling with BiGRNN

In this subsection, We show the details of our proposed Prescription Level Language Modeling. As is shown in Figure 1, we first encode the whole prescription (except the central herb) into a prescription level vector (vector that encodes the information of the whole prescription) , and then predict the central herb based on this prescription level vector .

  1. We take the one-hot herbs in the prescription as input, and project them into the corresponding embeddings .

  2. Then we go over the whole prescription except the central word to be predicted with BiGRNN, which gives us the hidden states .

  3. After that, We apply last pooling with the hidden states, and get the context vector , which is expected to embody the information of the whole prescription.

  4. Finally, we use a regression layer to predict the central herb.

We encode the whole prescription into a fixed length vector in order to capture the dependency beyond the local windows of the herbs. In this way, even the last herb can be an auxiliary to the first herb in the prescription. We believe this is very important compared with the baseline RNNLM model, as it separates the herbs before and after the central one. Further more, the vector of the whole prescription can also be a good representation of the disease that the prescription wants to tackle, which we would like to explore in the future.

5 Application on TCM Treatment

In the TCM diagnosis and treatment procedure, different from modern medicine science, doctors usually have more freedom when writing a prescription based on his own observation instead of a standard process. Still, they often refer to the classical prescriptions recorded in the TCM classics, for instance, Treatise on Febrile Diseases (《伤寒论》). In these classics, there are not only the principles for giving the prescriptions but also some widely used, carefully constructed prescriptions. In this section, based on this observation, we propose a language modeling style method based on the model we learn from the classical prescriptions, which can hopefully give hints to doctors on writing prescriptions for patients. On thing that should be noted is that our proposed method is more of a prototype rather than a complete tool.

Doctors start to write prescriptions after they have made a judgment on the patients’ situations. Each time a herb is given, our model would process the unfinished prescription, and then suggest a candidate herb that the doctor may want to use. The CBOW, RNNLM, PLLM models are the same as described in Section 4

. We also apply N-gram model as our baseline model. N-gram model is similar to how it is used in NLP, which predicts the next herb by selecting the herb with the largest likelihood. The likelihood is given by the linear combination of unigram, bigram and trigram transition probabilities. The parameters of these models are all trained on the dataset we build, which consists of classical prescriptions. After our model predicts the most probable herb, doctors can choose whether to take the advice or not.

0:  The unfinished prescription, trained model
1:  Read in the herbs in the unfinished prescription
2:  Choose a model out of n-gram, RNNLM, CBOW, PLLM to process the herbs.
3:  Predict a herb that fits the current unfinished prescription.
Algorithm 1 TCM prescription auxiliary composition

6 Experiments

In all of the following experiments, we use our distributed herb representations in an unsupervised way. The distance between two herbs are all given by the cosine distance between the two vectors.

6.1 Correlation

In this section, we show the correlation results between the professionals and various models. We use the HerbSim80 dataset described in Section 3.1. For LSA, the vector size is set to be 20, while for other models, the vector size is set to be 100. The gensim toolkit 555http://radimrehurek.com/gensim/ is used to train the LSA model. For GloVe666https://nlp.stanford.edu/projects/glove/ and CBOW777https://code.google.com/p/word2vec/

we use their official program respectfully. The similarity score of two herbs is given by the cosine distance between the vectors of the herbs. We use Spearman’s rank score as the criteria to evaluate the correlation between our model and the professionals. Our RNNLM, PLLM models are built using Tensorflow toolkit

(Abadi et al., 2015). We choose Adam(Kingma and Ba, 2014)

as the optimization method. An early stopping strategy is adopted to avoid over-fitting in the training process. We stop the training process when the accuracy of herb prediction in the development set fails to increase in the last three training epochs.

In the bottom row of Table 3

, we show the correlation result of a human junior student who majors in TCM. From the table we can see that PLLM model gives the best result, which surpasses the result of the student by over 10%. This phenomenon shows that our PLLM model can learn some useful knowledge out of the prescriptions in the dataset with unsupervised learning. An overall description of the prescription can indeed help predicting the herb. The simple model CBOW also gives a rather good result of 49.33%. The traditional LSA model doesn’t perform well in this experiment, maybe because it omits the local information of the herbs, which plays a more important role in TCM prescriptions. GloVe suffers the same problem that the objective of global part influences the representation of the local context.

Model Spearman’s R
Student 0.4506
LSA 0.3861
GloVe 0.3952
CBOW 0.4933
RNNLM 0.5158
PLLM 0.5535
Table 3: Correlation with Professionals’ judgment. means the correlation score with the Junior student.

6.2 Prediction

In the Section 5, we propose to use our model to assist doctors to write prescriptions. We manually build a test set consisting of 206 prescriptions. For each prescription, we temporarily blank one of the herbs, which is randomly chosen, and test whether our models could predict what the original herb is. The original prescriptions all have at least four herbs. Some examples are shown in Table 4, where the herb in the Answer column is the blank in the Question column.

Question Answer
麻黄 ___ 杏仁  炙甘草 桂枝
生地黄 当归 牡丹皮 ___ 升麻 黄连
Table 4: Examples from the test set for herb prediction. The characters in the Question column are TCM herb names. The characters in the Answer column is the herb name that should be put into the blank in the Question

In this experiment, we use bigram and trigram from both directions for n-gram prediction. For the prediction score, we simply add up the probabilities with the same weights.


From Table 5 we can see that the baseline model N-gram is very strong. The accuracy of N-gram model is even higher than RNNLM model, which shows that directly transfer the language modeling method from NLP may not be a good idea when predicting the next herb. CBOW model is slightly different from the original one, we average all the herb embeddings of the prescription, based on which we predict the blanked one. We assume the reason that it gives a rather good result is that it makes use of a wider range of context. In this experiment, our PLLM gets the best result, which is much higher than other models. We assume that it is necessary to consider the whole prescription when predicting the next herb.

Again we clarify that this application is a prototype. It doesn’t mean that we don’t need to consider other factors like the patients’ situation when composing a prescription. What we want to show is that the combination of herbs play an important role when composing a prescription and our model can capture this kind of pattern on some level.

Model Accuracy
N-gram 17.96
CBOW 20.39
RNNLM 16.50
PLLM 32.04
Table 5: Accuracy on herb prediction

6.3 Further Observation

In the experiments, we observe that the distributed vectors of herbs have linear algebraic relationships. For example, [熟地黄 (prepared rehmannia root)] [生地黄 (dried rehamnnia root)] [煨姜 (roasted ginger)] [生姜 (ginger)]. This phenomenon is similar to the observation described in Mikolov et al. (2013a), where . In the future, we hope to further look into this and see whether this is a general phenomenon in TCM herbs.

7 Conclusion and Future Work

In this paper, we propose to represent TCM herbs with distributed representation via Prescription Level Language modeling. In the experiments we testify that simply adopting the methods from NLP field is problematic because of the difference that lies between natural language and TCM prescriptions. Furthermore, we propose a possible application for our models in TCM treatment, which we hope can facilitate doctors composing prescriptions.