Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages

08/10/2015 ∙ by Nut Limsopatham, et al. ∙ University of Cambridge 0

Previous studies have shown that health reports in social media, such as DailyStrength and Twitter, have potential for monitoring health conditions (e.g. adverse drug reactions, infectious diseases) in particular communities. However, in order for a machine to understand and make inferences on these health conditions, the ability to recognise when laymen's terms refer to a particular medical concept (i.e. text normalisation) is required. To achieve this, we propose to adapt an existing phrase-based machine translation (MT) technique and a vector representation of words to map between a social media phrase and a medical concept. We evaluate our proposed approach using a collection of phrases from tweets related to adverse drug reactions. Our experimental results show that the combination of a phrase-based MT technique and the similarity between word vector representations outperforms the baselines that apply only either of them by up to 55

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Social media, such as DailyStrength111http://www.dailystrength.org/ and Twitter222http://twitter.com, is a fast growing and potentially rich source of voice of the patient data about experience in terms of benefits and side-effects of drugs and treatments [O’Connor et al.2014]. However, natural language understanding from social media messages is a difficult task because of the lexical and grammatical variability of the language [Baldwin et al.2013, O’Connor et al.2014]. Indeed, language understanding by machines requires the ability to recognise when a phrase refers to a particular concept. Given a variable length phrase, an effective system should return a concept with the most similar meaning. For example, a Twitter phrase ‘No way I’m gettin any sleep 2nite’ might be mapped to the medical concept ‘Insomnia’ (SNOMED:193462001), when using the SNOMED-CT dictionary [Spackman et al.1997]. The success of the mapping between social media phrases and formal medical concepts would enable an automatic integration between patient experiences and biomedical databases.

Existing works, e.g. [Elkin et al.2012, Gobbel et al.2014, Wang et al.2009], mostly focused on extracting medical concepts from medical documents. For example, Gobbel et al. gobbel2014development proposed a naive Bayesian-based technique to map phrases from clinical notes to medical concepts in the SNOMED-CT dictionary. Wang et al. wang2009active identified medical concepts regarding adverse drug events in electronic medical records. On the other hand, OConnor et al. o2014pharmacovigilance investigated the normalisation of medical terms in Twitter messages. In particular, they proposed to use the Lucene retrieval engine333http://lucene.apache.org/ to retrieve medical concepts that could be potentially mapped to a given Twitter phrase, when mapping between Twitter phrases and medical concepts.

In contrast, we argue that the medical text normalisation task [Limsopatham and Collier2015] can be achieved by using well-established phrase-based MT techniques, where we translate a text written in a social media language (e.g. ‘No way I’m gettin any sleep 2nite’) to a text written in a formal medical language (e.g. ‘Insomnia’). Indeed, in this work we investigate an effective adaptation of phrase-based MT to map a Twitter phrase to a medical concept. Moreover, we propose to combine the adapted phrase-based MT technique and the similarity between word vector representations to effectively map a Twitter phrase to a medical concept.

The main contributions of this paper are three-fold:

  1. [nolistsep]

  2. We investigate the adaptation of phrase-based MT to map a Twitter phrase to a SNOMED-CT concept.

  3. We propose to combine our adaptation of phrase-based MT and the similarity between word vector representations to map Twitter phrases to formal medical concepts.

  4. We thoroughly evaluate the proposed approach using phrases from our collection of tweets related to the topic of adverse drug reactions (ADRs).

2 Related Work

Phrase-based MT models (e.g. [Koehn et al.2003, Och and Ney2004]) have been shown to be effective in translation between languages, as they learn local term dependencies, such as collocations, re-orderings, insertions and deletions. Koehn et al. koehn2003statistical showed that a phrase-based MT technique markedly outperformed traditional word-based MT techniques on several benchmarks. In this work, we adapt the phrase-based MT technique of Koehn et al. koehn2003statistical for the medical text normalisation task. In particular, we use the phrase-based MT technique to translate phrases from Twitter language to formal medical language, before mapping the translated phrases to medical concepts based on the ranked similarity of their word vector representations.

Traditional approaches for creating word vector representations treated words as atomic units [Mikolov et al.2013b, Turian et al.2010]. For instance, the one-hot representation used a vector with a length of the size of the vocabulary, where one dimension is on, to represent a particular word [Turian et al.2010]. Recently, techniques for learning high-quality word vector representations (i.e. distributed word representations) that could capture the semantic similarity between words, such as continuous bags of words (CBOW) [Mikolov et al.2013b] and global vectors (GloVe) [Pennington et al.2014], have been proposed. Indeed, these distributed word representations have been effectively applied in different systems that achieve state-of-the-art performances for several NLP tasks, such as MT [Mikolov et al.2013a]

and named entity recognition 

[Passos et al.2014]. In this work, beside using word vector representations to measure the similarity between translated Twitter phrases and medical concepts, we use the similarity between word vector representations of the original Twitter phrase and a medical concept to augment the adapted phrase-based MT technique.

3 Medical Term Normalisation

We discuss our adaptation of phrase-based MT for medical text normalisation in Section 3.1. Section 3.2 introduces our proposed approach for combining similarity score of word vector representations with the adapted phrase-based MT technique.

3.1 Adapting Phrase-based MT

We aim to learn a translation between a Twitter phrase (i.e. a phrase from a Twitter message) and a formal medical phrase (i.e. the description of a medical concept). For a given Twitter phrase , we find a suitable medical phrase using a translation score, based on a phrase-based model, as follows:

(1)

where can be calculated using any phrase-based MT technique, e.g. [Koehn et al.2003, Och and Ney2004]. We then rank translated phrases based on this translation score. The top- translated phrases are used for identifying the corresponding medical concept.

However, the translated phrase

may not be exactly matched with the description of any target medical concepts. We propose two techniques to deal with this problem. For the first technique, we rank the target concepts based on the cosine similarity between the vector representation of

and the vector representation of the description of each concept :

(2)

where and are the vector representations of and , respectively. Any technique for creating word vector representations (e.g. one-hot, CBOW and GloVe) can be used. Note that if a phrase (e.g. ) contains several terms, we create a vector representation by summing the value of the same dimension of the vector representation of each term (i.e. element-wise addition).

On the other hand, the second technique also incorporates the ranked position of the translated phrase when translated from the original phrase using Equation (1). Indeed, the second technique calculates the similarity score as follows:

(3)

3.2 Combining Similarity Score with Phrase-based MT

As discussed in Section 2, word vector representations (e.g. created by CBOW or GloVe) can capture semantic similarity between words by itself. Hence, we propose to map a Twitter phrase to a medical concept , which is represented with a description , by linearly combining the cosine similarity, between vector representations of the Twitter phrase and the description , with the similarity score computed using one of the adapted phrased-based MT techniques (introduced in Section 3.1), as follows:

(4)

where is calculated using one of the adapted phrase-based MT techniques described in Section 3.1.

4 Experimental Setup

4.1 Test Collection

To evaluate our approach, we use a collection of 25 million tweets related to adverse drug reactions (ADRs). In particular, these tweets are related to cognitive enhancers [Hanson et al.2013] and anti-depressants [Schneeweiss et al.2010] that can have adverse side effects. We use 201 ADR phrases and their corresponding SNOMED-CT concepts annotated by a PhD-level computational linguist. These phrases were anonymised by replacing numbers, user IDs, URIs, locations, email addresses, dates and drug names with appropriate tokens e.g. _NUMBER_.

4.2 Evaluation Approach

We conduct experiments using 10-fold cross validation, where the Twitter phrases are randomly divided into 10 separated folds. We address this task as a ranking task, where we aim to rank the medical concept with the highest similarity score, e.g. calculated using Equation (2), at the top rank. Hence, we evaluate our approach using Mean Reciprocal Rank (MRR) measure [Craswell2009]

, which is based on the the reciprocal of the rank at which the first relevant concept is viewed in the ranking. In addition, we compare the significant difference between the performance achieved by our proposed approach and the baselines using the paired t-test (

).

4.3 Word Vector Representation

We use three different techniques, including one-hot, CBOW and GloVe, to create word vector representations used in our approach (see Section 3). In particular, the vocabulary for creating the one-hot representation includes all terms in the Twitter phrases and the descriptions of the target SNOMED-CT concepts. Meanwhile, we create word vector representations based on CBOW and GloVe by using the word2vec444https://code.google.com/p/word2vec/ and GloVe555http://nlp.stanford.edu/projects/glove/ implementations. We learn the vector representations from the collections of tweets and medical articles, respectively, using window size of 10 words. The tweet collection (denoted Twitter) contains 419,702,147 English tweets, which are related to 11 drug names and 6 cities, while the medical article collection (denoted BMC) includes all medical articles from the BioMed Central666http://www.biomedcentral.com/about/datamining. For both CBOW and GloVe, we create vector representations with vector sizes 50 and 200, respectively.

Approach One-hot BMC Twitter
CBOW GloVe CBOW GloVe
50 200 50 200 50 200 50 200
vSim 0.1675 0.1771 0.1896 0.1840 0.1869 0.1812 0.1813 0.0936 0.1807
bestMT 0.2232 0.1926 0.2070 0.1803 0.2500 0.2014 0.2047 0.1258 0.2138
top5MT 0.2491 0.1994 0.2104 0.1879 0.2638 0.2037 0.2095 0.1322 0.2362
top5MTr 0.2458 0.1982 0.2109 0.1894 0.2617 0.2037 0.2096 0.1322 0.2310
bestMT+vSim 0.2420 0.1910 0.1953 0.1860 0.2532 0.1891 0.1954 0.1078 0.2374
top5MT+vSim 0.2556 0.1916 0.2144 0.1726 0.2600 0.1978 0.2068 0.1079 0.2405
top5MTr+vSim 0.2594 0.1861 0.2070 0.1802 0.2590 0.1959 0.2027 0.1129 0.2406
Table 1: MRR-5 performance of the proposed approach and the baselines. Significant differences () with the cosine similarity (vSim) baselines with the one-hot representation, and with the corresponding distributed word representation (e.g. CBOW or GloVe) are denoted and , respectively.

4.4 Learning Phrase-based Model

We use the phrase-based MT technique of Koehn et al. koehn2003statistical, as implemented in the Moses toolkit [Koehn et al.2007] with default settings, to learn to translate from the Twitter language to the medical language. In particular, when training the translator, we show the learner pairs of the Twitter phrases and descriptions of the corresponding SNOMED-CT concepts.

5 Experimental Results

We evaluate 6 different instantiations of the proposed approach discussed in Section 3, including:

  1. [nolistsep]

  2. bestMT: set , when finding the translated phrase for a Twitter phrase (Equation (1)), before ranking target medical concepts for the translated phrase using Equation (2).

  3. top5MT: similar to bestMT, but set .

  4. top5MTr: similar to top5MT, but also consider the rank position of the translate phrases when ranking the target medical concepts by using Equation (3).

  5. bestMT+vSim: incorporate with the ranking generated from bestMT, the cosine similarity between the vector representations of the Twitter phrase and the description of target medical concepts by using Equation (4).

  6. top5MT+vSim: similar to bestMT+vSim, but use the ranking from top5MT.

  7. top5MTr+vSim: similar to bestMT+vSim, but use the ranking from top5MTr.

Another baseline is vSim, where we consider only the cosine similarity between the vector representations of the Twitter phrase and the description of target medical concepts.

Table 1 compares the performance of these 6 instantiations and the vSim baseline in terms of MRR-5. We firstly observe that for the vSim baseline, excepting for word vector representation with vector size 50 learned using GloVe from the Twitter collection, word vector representations learned using either CBOW or GloVe are more effective than the one-hot representation. However, the difference between the MRR-5 performance is not statistically significant (, paired t-test). In addition, word vector representations learned either using CBOW or GloVe with vector size 200 is more effective than those with vector size 50.

Next, we find that our adaptation of phrase-based MT (i.e. bestMT, top5MT and top5MTr) significantly () outperforms the vSim baseline. For example, with the one-hot representation, top5MT (MRR-5 0.2491) and top5MTr (MRR-5 0.2458) perform significantly () better than vSim (MRR-5 0.1675) by up to 49%. Meanwhile, when using word vector representations with the vector size 200 learned using GloVe from the BMC collection, top5MT (MRR-5 0.2638) significantly () outperforms vSim with both the GloVe vector representation (MRR-5 0.1869) and the one-hot representation (MRR-5 0.1675). We observe the similar trends in performance when using vector representations learned from the Twitter collection. These results show that our adapted phase-based MT techniques are effective for the medical term normalisation task.

In addition, we observe the effectiveness of our combined approach (i.e. bestMT+vSim, top5MT+vSim and top5MTr+vSim), as it further improves the performance of the adapted phrase-based MT (i.e. bestMT, top5MT and top5MTr, respectively), when using the one-hot representation. For example, top5MTr+vSim achieves the MRR-5 of 0.2594, while the MRR-5 of top5MTr is 0.2458. However, the performance difference is not statistically significant. Meanwhile, when using the CBOW and GloVe vectors, the achieved performance is varied based on the collection (i.e. BMC or Twitter) used for learning the vectors and the size of the vectors.

6 Conclusions

We have introduced our approach that adapts a phrase-based MT technique to normalise medical terms in Twitter messages. We evaluate our proposed approach using a collection of phrases from tweets related to ADRs. Our experimental results show that the proposed approach significantly outperforms an effective baseline by up to 55%. For future work, we aim to investigate the modelling of learned vector representation, such as CBOW and GloVe, within a phrase-based MT model when normalising medical terms.

Acknowledgements

The authors gratefully acknowledge Nestor Alvaro (Sokendai, Japan) for providing access to the Twitter/SNOMED-CT annotations which were used to derive the test collection used in these experiments. The derived dictionary and a representative sample of the word vector representations (CBOW and GloVe at 200d) are made available on Zenodo.org (DOI: http://dx.doi.org/10.5281/zenodo.27354). We wish to thank funding support from the EPSRC (grant number EP/M005089/1).

References

  • [Baldwin et al.2013] Timothy Baldwin, Paul Cook, Marco Lui, Andrew MacKinlay, and Li Wang. 2013. How noisy social media text, how diffrnt social media sources. In

    Proceedings of the Sixth International Joint Conference on Natural Language Processing

    , pages 356–364.
  • [Craswell2009] Nick Craswell. 2009. Mean reciprocal rank. In Encyclopedia of Database Systems, pages 1703–1703. Springer.
  • [Elkin et al.2012] Peter L Elkin, David A Froehling, Dietlind L Wahner-Roedler, Steven H Brown, and Kent R Bailey. 2012. Comparison of natural language processing biosurveillance methods for identifying influenza from encounter notes. Annals of Internal Medicine, 156(1_Part_1):11–18.
  • [Gobbel et al.2014] Glenn T Gobbel, Ruth Reeves, Shrimalini Jayaramaraja, Dario Giuse, Theodore Speroff, Steven H Brown, Peter L Elkin, and Michael E Matheny. 2014.

    Development and evaluation of raptat: a machine learning system for concept mapping of phrases from medical narratives.

    Journal of biomedical informatics, 48:54–65.
  • [Hanson et al.2013] Carl L Hanson, Scott H Burton, Christophe Giraud-Carrier, Josh H West, Michael D Barnes, and Bret Hansen. 2013. Tweaking and tweeting: exploring twitter for nonmedical use of a psychostimulant drug (adderall) among college students. Journal of medical Internet research, 15(4).
  • [Koehn et al.2003] Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 48–54. Association for Computational Linguistics.
  • [Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguistics.
  • [Limsopatham and Collier2015] Nut Limsopatham and Nigel Collier. 2015. Towards the Semantic Interpretation of Personal Health Messages from Social Media. In Proceedings of the 1st International Workshop on Understanding the City with Urban Informatics, UCUI@CIKM 2015, Association for Computing Machinery.
  • [Mikolov et al.2013a] Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
  • [Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • [Och and Ney2004] Franz Josef Och and Hermann Ney. 2004. The alignment template approach to statistical machine translation. Computational linguistics, 30(4):417–449.
  • [O’Connor et al.2014] Karen O’Connor, Pranoti Pimpalkhute, Azadeh Nikfarjam, Rachel Ginn, Karen L Smith, and Graciela Gonzalez. 2014. Pharmacovigilance on twitter? mining tweets for adverse drug reactions. In AMIA Annual Symposium Proceedings, volume 2014, page 924. American Medical Informatics Association.
  • [Passos et al.2014] Alexandre Passos, Vineet Kumar, and Andrew McCallum. 2014. Lexicon infused phrase embeddings for named entity resolution. arXiv preprint arXiv:1404.5367.
  • [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12:1532–1543.
  • [Schneeweiss et al.2010] Sebastian Schneeweiss, Amanda R Patrick, Daniel H Solomon, Colin R Dormuth, Matt Miller, Jyotsna Mehta, Jennifer C Lee, and Philip S Wang. 2010. Comparative safety of antidepressant agents for children and adolescents regarding suicidal acts. Pediatrics, pages peds–2009.
  • [Spackman et al.1997] Kent A Spackman, Keith E Campbell, and Roger A Côté. 1997. Snomed rt: a reference terminology for health care. In Proceedings of the AMIA annual fall symposium, page 640. American Medical Informatics Association.
  • [Turian et al.2010] Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.

    Word representations: a simple and general method for semi-supervised learning.

    In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 384–394. Association for Computational Linguistics.
  • [Wang et al.2009] Xiaoyan Wang, George Hripcsak, Marianthi Markatou, and Carol Friedman. 2009. Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. Journal of the American Medical Informatics Association, 16(3):328–337.

Appendix

Tables 2 and 3 report the MRR-5 performance when using the word vector representation learned from the BMC and Twitter collections with window sizes 50, 100 and 200, using CBOW and GloVe.

Approach One-hot CBOW GloVe
50 100 200 50 100 200
vSim 0.1675 0.1771 0.1882 0.1896 0.1840 0.1593 0.1869
bestMT 0.2232 0.1926 0.1956 0.2070 0.1803 0.2338 0.2500
top5MT 0.2491 0.1994 0.1971 0.2104 0.1879 0.2425 0.2638
top5MTr 0.2458 0.1982 0.1971 0.2109 0.1894 0.2391 0.2617
bestMT+vSim 0.2420 0.1910 0.1893 0.1953 0.1860 0.2375 0.2532
top5MT+vSim 0.2556 0.1916 0.2025 0.2144 0.1726 0.2381 0.2600
top5MTr+vSim 0.2594 0.1861 0.1918 0.2070 0.1802 0.2451 0.2590
Table 2: MRR-5 performance of the proposed approach when the word vector representation created by CBOW and GloVe is learned from the BMC collection with window sizes 50, 100 and 200. Significant differences () with the cosine similarity with the one-hot representation, and the cosine similarity with the corresponding distributed word representation vector are denoted and , respectively.
Approach One-hot CBOW GloVe
50 100 200 50 100 200
vSim 0.1675 0.1812 0.1901 0.1813 0.0936 0.1836 0.1807
bestMT 0.2232 0.2014 0.1993 0.2047 0.1258 0.2114 0.2138
top5MT 0.2491 0.2037 0.2060 0.2095 0.1322 0.2320 0.2362
top5MTr 0.2458 0.2037 0.2037 0.2096 0.1322 0.2279 0.2310
bestMT+vSim 0.2420 0.1891 0.1959 0.1954 0.1078 0.2161 0.2374
top5MT+vSim 0.2556 0.1978 0.2033 0.2068 0.1079 0.2420 0.2405
top5MTr+vSim 0.2594 0.1959 0.1913 0.2027 0.1129 0.2352 0.2406
Table 3: MRR-5 performance of the proposed approach when the word vector representation created by CBOW and GloVe is learned from the Twitter collection with window sizes 50, 100 and 200. Significant differences () with the cosine similarity with the one-hot representation, and the cosine similarity with the corresponding distributed word representation vector are denoted and , respectively.