Probabilistic Modelling of Morphologically Rich Languages

08/18/2015 ∙ by Jan A. Botha, et al. ∙ 0

This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

References

  • Adel et al. (2013) H. Adel, N. T. Vu, and T. Schultz. 2013.

    Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling.

    In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 206–211, Sofia, Bulgaria, August. Association for Computational Linguistics.
  • Albert et al. (2013) A. Albert, B. MacWhinney, B. Nir, and S. Wintner. 2013. The Hebrew CHILDES corpus: transcription and morphological analysis. Language Resources and Evaluation, 47(4):973–1005.
  • Alexandrescu and Kirchhoff (2006) A. Alexandrescu and K. Kirchhoff. 2006. Factored Neural Language Models. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, New York City, USA, June. Association for Computational Linguistics.
  • Allauzen et al. (2013) A. Allauzen, N. Pechéux, Q. K. Do, M. Dinarelli, T. Lavergne, A. Max, H.-S. Le, and F. Yvon. 2013. LIMSI@ WMT13. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 62–69, Sofia, Bulgaria, August. Association for Computational Linguistics.
  • Baroni and Matiasek (2002) M. Baroni and J. Matiasek. 2002. Predicting the components of German nominal compounds. In

    Proceedings of the 15th European Conference on Artificial Intelligence (ECAI)

    , pages 470–474, Lyon, France, July.
  • Baroni and Zamparelli (2010) M. Baroni and R. Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1183–1193, Cambridge, MA.
  • Baroni et al. (2002) M. Baroni, J. Matiasek, and H. Trost. 2002. Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning, pages 48–57. Association for Computational Linguistics, July.
  • Beal (2003) M. J. Beal. 2003.

    Variational Algorithms for Approximate Bayesian Inference

    .
    Ph.D. thesis, University of London.
  • Beesley and Karttunen (2003) K. R. Beesley and L. Karttunen. 2003. Finite state morphology, volume 18. CSLI Publications Stanford.
  • Bengio et al. (2003) Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. 2003. A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3:1137–1155.
  • Berton et al. (1996) A. Berton, P. Fetter, and P. Regel-Brietzmann. 1996. Compound Words in Large-Vocabulary German Speech Recognition Systems. In Proceedings of Fourth International Conference on Spoken Language Processing. ICSLP ’96, volume 2, pages 1165–1168. IEEE.
  • Bilmes and Kirchhoff (2003) J. A. Bilmes and K. Kirchhoff. 2003. Factored Language Models and Generalized Parallel Backoff. In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics: short papers. Association for Computational Linguistics.
  • Blunsom and Cohn (2011) P. Blunsom and T. Cohn. 2011. A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 865–874, Portland, Oregon, USA, June. Association for Computational Linguistics.
  • Blunsom et al. (2009) P. Blunsom, T. Cohn, S. Goldwater, and M. Johnson. 2009. A Note on the Implementation of Hierarchical Dirichlet Processes. In Proceedings of the ACL-IJCNLP 2009 Conference: Short Papers, pages 337–340. Association for Computational Linguistics, August.
  • Botha (2012) J. A. Botha. 2012. Hierarchical Bayesian Language Modelling for the Linguistically Informed. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 64–73, Avignon, France. Association for Computational Linguistics.
  • Botha and Blunsom (2013) J. A. Botha and P. Blunsom. 2013. Adaptor Grammars for Learning Non-Concatenative Morphology. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, October.
  • Botha and Blunsom (2014) J. A. Botha and P. Blunsom. 2014. Compositional Morphology for Word Representations and Language Modelling. In Proceedings of the 31st International Conference on Machine Learning, ICML, pages 1899–1907, Beijing, China, June.
  • Botha et al. (2012) J. A. Botha, C. Dyer, and P. Blunsom. 2012. Bayesian Language Modelling of German Compounds. In Proceedings of the 24th International Conference on Computational Linguistics (COLING), Mumbai, India.
  • Boudlal et al. (2009) A. Boudlal, R. Belahbib, A. Lakhouaja, A. Mazroui, A. Meziane, and M. Bebah. 2009. A Markovian approach for Arabic Root Extraction. The International Arab Journal of Information Technology, 8(1):91–98.
  • Boullier (2000) P. Boullier. 2000. A cubic time extension of context-free grammars. Grammars, 3(2-3):111–131.
  • Brants et al. (2007) T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. 2007. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), volume 1, pages 858–867. Association for Computational Linguistics, June.
  • Brown et al. (1992) P. F. Brown, P. V. DeSouza, R. L. Mercer, V. J. Della Pietra, and J. C. Lai. 1992. Class-Based n-gram Models of Natural Language. Computational Linguistics, 18(4):467–479.
  • Buckwalter (2002) T. Buckwalter. 2002. Buckwalter Arabic Morphological Analyzer Version 1.0. Technical report, Linguistic Data Consortium.
  • Bullinaria and Levy (2007) J. A. Bullinaria and J. P. Levy. 2007. Extracting Semantic Representations from Word Co-occurrence Statistics: A Computational Study. Behavior research methods, 39(3):510–26, August.
  • Çöltekin (2010) Ç. Çöltekin. 2010. A Freely Available Morphological Analyzer for Turkish. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC2010), pages 820–827.
  • Chahuneau et al. (2013a) V. Chahuneau, E. Schlinger, N. A. Smith, and C. Dyer. 2013a. Translating into Morphologically Rich Languages with Synthetic Phrases. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1677–1687, Seattle, Washington, USA. Association for Computational Linguistics.
  • Chahuneau et al. (2013b) V. Chahuneau, N. A. Smith, and C. Dyer. 2013b. Knowledge-rich morphological priors for Bayesian language models. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1206–1215, Atlanta, Georgia. Association for Computational Linguistics.
  • Chelba et al. (2014) C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. 2014. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), pages 2635–2639, Singapore.
  • Chen and Goodman (1998) S. F. Chen and J. Goodman. 1998. An Empirical Study of Smoothing Techniques for Language Modeling. Technical report, Center for Research in Computing Technology, Harvard University, Cambridge, MA.
  • Chen and Goodman (1999) S. F. Chen and J. Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13:359–394, October.
  • Chiang (2007) D. Chiang. 2007. Hierarchical Phrase-Based Translation. Computational Linguistics, 33(2):201–228.
  • Clark (2001a) A. Clark. 2001a. Unsupervised induction of stochastic context-free grammars using distributional clustering. In Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL), pages 105–112, Toulouse, France. Association for Computational Linguistics.
  • Clark (2001b) A. Clark. 2001b.

    Partially Supervised Learning of Morphology with Stochastic Transducers.

    In Proceedings of Natural Language Processing Pacific Rim Symposium, pages 341–348, Tokyo, Japan, November.
  • Clark (2002) A. Clark. 2002. Memory-Based Learning of Morphology with Stochastic Transducers. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 513–520.
  • Clark et al. (2011) J. H. Clark, C. Dyer, A. Lavie, and N. A. Smith. 2011. Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 176–181, Portland, Oregon, USA.
  • Cohen et al. (2010) S. B. Cohen, D. M. Blei, and N. A. Smith. 2010. Variational Inference for Adaptor Grammars. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 564–572, Los Angeles, California. Association for Computational Linguistics.
  • Cohen-Sygal and Wintner (2006) Y. Cohen-Sygal and S. Wintner. 2006. Finite-state registered automata for non-concatenative morphology. Computational Linguistics, 32(1):49–82.
  • Cohn et al. (2010) T. Cohn, P. Blunsom, and S. Goldwater. 2010. Inducing Tree-Substitution Grammars. Journal of Machine Learning Research, 11:3053–3096.
  • Collobert and Weston (2008) R. Collobert and J. Weston. 2008. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland.
  • Creutz and Lagus (2007) M. Creutz and K. Lagus. 2007. Unsupervised Models for Morpheme Segmentation and Morphology Learning. ACM Transactions on Speech and Language Processing, 4(1):1–34.
  • Darwish (2002) K. Darwish. 2002. Building a shallow Arabic morphological analyzer in one day. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages, pages 47–54. Association for Computational Linguistics.
  • Daya et al. (2008) E. Daya, D. Roth, and S. Wintner. 2008. Identifying Semitic Roots: Machine Learning with Linguistic Constraints. Computational Linguistics, 34(3):429–448.
  • de Almeida and Libben (2005) R. G. de Almeida and G. Libben. 2005. Changing morphological structures: The effect of sentence context on the interpretation of structurally ambiguous English trimorphemic words. Language and Cognitive Processes, 20(1-2):373–394.
  • de Marcken (1996) C. G. de Marcken. 1996. Unsupervised Language Acquisition. Ph.D. thesis, Massachusetts Institute of Technology.
  • de Roeck and Al-Fares (2000) A. N. de Roeck and W. Al-Fares. 2000. A morphologically sensitive clustering algorithm for identifying Arabic roots. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 199–206. Association for Computational Linguistics.
  • Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12:2121–2159.
  • Duh et al. (2013) K. Duh, G. Neubig, K. Sudoh, and H. Tsukada. 2013. Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 678–683, Sofia, Bulgaria.
  • Dukes and Habash (2010) K. Dukes and N. Habash. 2010. Morphological Annotation of Quranic Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC).
  • Dyer (2009) C. Dyer. 2009. Using a maximum entropy model to build segmentation lattices for MT. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 406–414, Boulder, Colorado. Association for Computational Linguistics.
  • Dyer et al. (2010) C. Dyer, A. Lopez, J. Ganitkevitch, J. Weese, F. Ture, P. Blunsom, H. Setiawan, V. Eidelman, and P. Resnik. 2010. cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models. In Proceedings of the ACL 2010 System Demonstrations, pages 7–12, Uppsala, Sweden. Association for Computational Linguistics.
  • Dyer et al. (2013) C. Dyer, V. Chahuneau, and N. A. Smith. 2013. A Simple, Fast, and Effective Reparameterization of IBM Model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644–648, Atlanta, Georgia, USA. Association for Computational Linguistics.
  • Egan (1975) J. P. Egan. 1975. Signal detection theory and ROC analysis. Series in Cognition and Perception. Academic Press, New York, NY.
  • Fawcett (2006) T. Fawcett. 2006. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874, June.
  • Ferguson (1973) T. S. Ferguson. 1973. A Bayesian analysis of some nonparametric problems. Annals of Statistics, (1):209–230.
  • Finkel and Stump (2002) R. Finkel and G. Stump. 2002. Generating Hebrew verb morphology by default inheritance hierarchies. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. Association for Computational Linguistics.
  • Finkelstein et al. (2002) L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. 2002. Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems, 20(1):116–131.
  • Forsberg and Ranta (2004) M. Forsberg and A. Ranta. 2004. Functional Morphology. In Proceedings of the Ninth ACM SIGPLAN International Conference on Functional Programming, pages 213–223, Utah, September. Association for Computing Machinery.
  • Frank et al. (2013) S. Frank, F. Keller, and S. Goldwater. 2013. Exploring the utility of joint morphological and syntactic learning from child-directed speech. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA. Association for Computational Linguistics.
  • Fullwood and O’Donnell (2013) M. A. Fullwood and T. J. O’Donnell. 2013. Learning non-concatenative morphology. In Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics, pages 21–27, Sofia, Bulgaria. Association for Computational Linguistics.
  • Gasser (2009) M. Gasser. 2009. Semitic morphological analysis and generation using finite state transducers with feature structures. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 309–317. Association for Computational Linguistics.
  • Geutner (1995) P. Geutner. 1995. Using Morphology Towards Better Large-Vocabulary Speech Recognition Systems. In Proceedings of the 1995 International Conference on Acoustics, Speech, and Signal Processing, pages 445–448. IEEE.
  • Gildea (2010) D. Gildea. 2010. Optimal Parsing Strategies for Linear Context-Free Rewriting Systems. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 769–776. Association for Computational Linguistics.
  • Gilks et al. (1996) W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. 1996. Markov Chain Monte Carlo in Practice. Chapman & Hall/CRC Interdisciplinary Statistics.
  • Goldsmith and Reutter (1998) J. Goldsmith and T. Reutter. 1998. Automatic Collection and Analysis of German Compounds. In F. Busa F. et al., editor, The Computational Treatment of Nominals, pages 61–69. Universite de Montreal, Canada.
  • Goldwater (2007) S. Goldwater. 2007. Nonparametric Bayesian Models of Lexical Acquisition. Ph.D. thesis, Brown University.
  • Goldwater et al. (2006) S. Goldwater, T. L. Griffiths, and M. Johnson. 2006. Interpolating Between Types and Tokens by Estimating Power-Law Generators. In Advances in Neural Information Processing Systems, pages 459–466.
  • Goldwater et al. (2011) S. Goldwater, T. L. Griffiths, and M. Johnson. 2011. Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models. Journal of Machine Learning Research, 12:2335–2382.
  • Good (1953) I. J. Good. 1953. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika, 40(3):237–264.
  • Goodman (2001) J. Goodman. 2001. Classes for Fast Maximum Entropy Training. In Proceedings of the 2001 International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 561–564. IEEE.
  • Goodman (1998) J. T. Goodman. 1998. Parsing Inside-Out. Ph.D. thesis, Harvard University.
  • Gurevych (2005) I. Gurevych. 2005. Using the Structure of a Conceptual Network in Computing Semantic Relatedness. In Proceedings of the Second International Joint Conference on Natural Language Processing: Full Papers, pages 767–778.
  • Gutmann and Hyvärinen (2012) M. U. Gutmann and A. Hyvärinen. 2012. Noise-Contrastive Estimation of Unnormalized Statistical Models , with Applications to Natural Image Statistics. The Journal of Machine Learning Research, 13:307–361.
  • Habash and Rambow (2005) N. Habash and O. Rambow. 2005. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 573–580, Ann Arbor, June. Association for Computational Linguistics.
  • Habash and Rambow (2006) N. Habash and O. Rambow. 2006. MAGEAD: a morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, volume M, pages 681–688, Sydney, Australia, July. Association for Computational Linguistics.
  • Habash and Sadat (2006) N. Habash and F. Sadat. 2006. Arabic preprocessing schemes for statistical machine translation. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 49–52. Association for Computational Linguistics.
  • Habash et al. (2005) N. Habash, O. Rambow, and G. Kiraz. 2005. Morphological Analysis and Generation for Arabic Dialects. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pages 17–24. Association for Computational Linguistics, June.
  • Habash (2010) N. Y. Habash. 2010. Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1):1–187.
  • Hammarström and Borin (2011) H. Hammarström and L. Borin. 2011. Unsupervised Learning of Morphology. Computational Linguistics, 37(2):309–350.
  • Hassan and Mihalcea (2009) S. Hassan and R. Mihalcea. 2009. Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1192–1201. Association for Computational Linguistics.
  • Heafield (2011) K. Heafield. 2011. KenLM: Faster and Smaller Language Model Queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197. Association for Computational Linguistics.
  • Heafield et al. (2013) K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn. 2013. Scalable Modified Kneser-Ney Language Model Estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 690–696, Sofia, Bulgaria, August. Association for Computational Linguistics.
  • Hermann and Blunsom (2013) K. M. Hermann and P. Blunsom. 2013. The Role of Syntax in Vector Space Models of Compositional Semantics. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 894–904, Sofia, Bulgaria.
  • Hirsimäki et al. (2006) T. Hirsimäki, M. Creutz, V. Siivola, M. Kurimo, S. Virpioja, and J. Pylkkönen. 2006. Unlimited vocabulary speech recognition with morph language models applied to Finnish. Computer Speech and Language, 20(4):515–541, October.
  • Huang et al. (2012) E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. 2012. Improving Word Representations via Global Context and Multiple Word Prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 873–882. Association for Computational Linguistics.
  • Huang and Renals (2009) S. Huang and S. Renals. 2009. A parallel training algorithm for hierarchical Pitman-Yor process language models. In Proceedings of Interspeech, volume 9, pages 2695–2698.
  • Huang and Renals (2010) S. Huang and S. Renals. 2010. Hierarchical Bayesian Language Models for Conversational Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(8):1941–1954, November.
  • Huang et al. (2011) Y. Huang, M. Zhang, and C. L. Tan. 2011. Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 534–539. Association for Computational Linguistics.
  • Ircing et al. (2001) P. Ircing, P. Krbec, J. Hajic, S. Khudanpur, F. Jelinek, J. Psutka, and W. Byrne. 2001. On large vocabulary continuous speech recognition of highly inflectional language – Czech. In Proceedings of the 7th European Conference on Speech Communication and Technology (EUROSPEECH), volume 1.
  • Ishwaran and James (2001) H. Ishwaran and L. F. James. 2001. Gibbs Sampling Methods for Stick-Breaking Priors. Journal of the American Statistical Association, 96(453):161–173.
  • Ishwaran and James (2003) H. Ishwaran and L. F. James. 2003. Generalized weighted Chinese restaurant processes for species sampling mixture models. Statistica Sinica, 13:1211–1235.
  • Jelinek and Mercer (1980) F. Jelinek and R. L. Mercer. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, pages 381–397, Amsterdam, The Netherlands.
  • Johnson (2008) M. Johnson. 2008. Unsupervised word segmentation for Sesotho using Adaptor Grammars. In Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology, pages 20–27. Association for Computational Linguistics.
  • Johnson and Goldwater (2009) M. Johnson and S. Goldwater. 2009. Improving nonparameteric Bayesian inference: Experiments on unsupervised word segmentation with adaptor grammars. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 317–325. Association for Computational Linguistics.
  • Johnson et al. (2007a) M. Johnson, T. L. Griffiths, and S. Goldwater. 2007a. Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models. In Advances in Neural Information Processing Systems, volume 19, page 641. MIT.
  • Johnson et al. (2007b) M. Johnson, T. L. Griffiths, and S. Goldwater. 2007b. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, number April, pages 139–146, Rochester, NY. Association for Computational Linguistics.
  • Joshi (1985) A. K. Joshi. 1985. Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions? In D. R. Dowty, L. Karttunen, and A. M. Zwicky, editors, Natural Language Parsing, chapter 6, pages 206–250. Cambridge University Press.
  • Joubarne and Inkpen (2011) C. Joubarne and D. Inkpen. 2011. Comparison of Semantic Similarity for Different Languages Using the Google N-gram Corpus and Second-Order Co-occurrence Measures. In Proceedings of the Canadian Conference on Advances in Artificial Intelligence, pages 216–221. Springer-Verlag.
  • Kalchbrenner and Blunsom (2013) N. Kalchbrenner and P. Blunsom. 2013. Recurrent Continuous Translation Models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709, Seattle, Washington, USA. Association for Computational Linguistics.
  • Kallmeyer (2010) L. Kallmeyer. 2010. Parsing beyond context-free grammars. Springer Verlag.
  • Kataja and Koskenniemi (1988) L. Kataja and K. Koskenniemi. 1988. Finite-state Description of Semitic Morphology: A Case Study of Ancient Akkadian. In Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics, pages 313–315.
  • Kato et al. (2006) Y. Kato, H. Seki, and T. Kasami. 2006. Stochastic Multiple Context-Free Grammar for RNA Pseudoknot Modeling. In Proceedings of the International Workshop on Tree Adjoining Grammar and Related Formalisms, pages 57–64, July.
  • Katz (1987) S. Katz. 1987.

    Estimation of probabilities from sparse data for the language model component of a speech recognizer.

    IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3):400–401, March.
  • Kay (1987) M. Kay. 1987. Nonconcatenative Finite-State Morphology. In Proceedings of the third conference of the European Chapter of the Association for Computational Linguistics.
  • Kiraz (2000) G. A. Kiraz. 2000. Multitiered Nonlinear Morphology Using Multitape Finite Automata: A Case Study on Syriac and Arabic. Computational Linguistics, 26(1):77–105, March.
  • Kirchhoff et al. (2008) K. Kirchhoff, J. Bilmes, and K. Duh. 2008. Factored Language Models Tutorial. Technical Report UWEETR-2008-0004, Department of Electrical Engineering, University ofWashington, Seattle.
  • Kneser and Ney (1995) R. Kneser and H. Ney. 1995. Improved Backing-off for m-gram Language Modelling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 181–184.
  • Koehn (2010) P. Koehn. 2010. Statistical Machine Translation. Cambridge University Press.
  • Koehn and Knight (2003) P. Koehn and K. Knight. 2003. Empirical Methods for Compound Splitting. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, pages 187–193. Association for Computational Linguistics.
  • Koehn et al. (2008) P. Koehn, A. Arun, and H. Hoang. 2008. Towards better Machine Translation Quality for the German-English Language Pairs. In Third Workshop on Statistical Machine Translation, pages 139–142. Association for Computational Linguistics, June.
  • Koskenniemi (1984) K. Koskenniemi. 1984. A general computational model for word-form recognition and production. In Proceedings of the 10th International Conference on Computational Linguistics and 22nd Annual Meeting of the Association for Computational Linguistics, pages 178–181. Association for Computational Linguistics.
  • Kurimo et al. (2010) M. Kurimo, S. Virpioja, V. T. Turunen, G. W. Blackwood, and W. Byrne. 2010. Overview and Results of Morpho Challenge 2009. In C. Peters, G. M. Nunzio, M. Kurimo, T. Mandl, D. Mostefa, A. Peñas, and G. Roda, editors, Multilingual Information Access Evaluation I. Text Retrieval Experiments, volume 6241 of Lecture Notes in Computer Science, pages 578–597. Springer Berlin / Heidelberg.
  • Lari and Young (1990) K. Lari and S. J. Young. 1990. The estimation of stochastic context-free grammars using the Inside-Outside algorithm. Computer Speech & Language, 4(1):35–56.
  • Lavie et al. (1988) A. Lavie, A. Itai, U. Ornan, and M. Rimon. 1988. On the Applicabtlity of Two Level Morphology to the Inflection of Hebrew Verbs. Technical Report CS0513, Computer Science Department, Israeli Institute of Technology, Technion.
  • Lazaridou et al. (2013) A. Lazaridou, M. Marelli, R. Zamparelli, and M. Baroni. 2013. Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1517–1526, Sofia, Bulgaria. Association for Computational Linguistics.
  • Le et al. (2011) H.-S. Le, I. Oparin, A. Allauzen, J.-L. Gauvain, and F. Yvon. 2011. Structured Output Layer Neural Network Language Model. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5524–5527, Prague, Czech Republic. IEEE.
  • Lee et al. (2011) Y. K. Lee, A. Haghighi, and R. Barzilay. 2011. Modeling syntactic context improves morphological segmentation. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics.
  • Levenberg et al. (2012) A. Levenberg, C. Dyer, and P. Blunsom. 2012. A Bayesian Model for Learning SCFGs with Discontiguous Rules. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 223–232, July.
  • Liang et al. (2006) P. Liang, B. Taskar, and D. Klein. 2006. Alignment by Agreement. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 104–111, New York City, USA, June. Association for Computational Linguistics.
  • Luong et al. (2013) M.-T. Luong, R. Socher, and C. D. Manning. 2013. Better Word Representations with Recursive Neural Networks for Morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning. Association for Computational Linguistics.
  • Maier (2010) W. Maier. 2010. Direct Parsing of Discontinuous Constituents in German. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 58–66. Association for Computational Linguistics.
  • McCarthy (1981) J. J. McCarthy. 1981. A Prosodic Theory of Nonconcatenative Morphology. Linguistic Inquiry, 12(3):373–418.
  • Mikolov et al. (2010) T. Mikolov, M. Karafiát, L. Burget, J. Černocký, and S. Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of Interspeech, pages 1045–1048, September.
  • Mikolov et al. (2011) T. Mikolov, S. Kombrink, L. Burget, J. Černocký, and S. Khudanpur. 2011. Extensions of Recurrent Neural Network Language Model. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Prague, Czech Republic.
  • Mikolov et al. (2013a) T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013a. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations. arXiv:1301.3781.
  • Mikolov et al. (2013b) T. Mikolov, W.-t. Yih, and G. Zweig. 2013b. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751. Association for Computational Linguistics.
  • Minkov et al. (2007) E. Minkov, K. Toutanova, and H. Suzuki. 2007. Generating Complex Morphology for Machine Translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 128–135, June.
  • Mitchell and Lapata (2008) J. Mitchell and M. Lapata. 2008. Vector-based Models of Semantic Composition. In Proceedings of ACL-08: HLT, pages 236–244, Columbus, Ohio, June. Association for Computational Linguistics.
  • Mitchell and Lapata (2010) J. Mitchell and M. Lapata. 2010. Composition in distributional models of semantics. Cognitive science, 34(8):1388–429, November.
  • Mnih and Hinton (2007) A. Mnih and G. Hinton. 2007. Three New Graphical Models for Statistical Language Modelling. In Proceedings of the 24th International Conference on Machine Learning, pages 641–648, Corvallis, Oregon.
  • Mnih and Hinton (2008) A. Mnih and G. Hinton. 2008. A Scalable Hierarchical Distributed Language Model. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, pages 1081–1088.
  • Mnih and Teh (2012) A. Mnih and Y. W. Teh. 2012. A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland.
  • Mnih et al. (2009) A. Mnih, Z. Yuecheng, and G. Hinton. 2009. Improving a statistical language model through non-linear prediction. Neurocomputing, 72(7-9):1414–1418.
  • Mochihashi et al. (2009) D. Mochihashi, T. Yamada, and N. Ueda. 2009. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 100–108, Suntec, Singapore. Association for Computational Linguistics.
  • Morin and Bengio (2005) F. Morin and Y. Bengio. 2005. Hierarchical probabilistic neural network language model. In R. G. Cowell and Z. Ghahramani, editors, Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pages 246–252. Society for Artificial Intelligence and Statistics.
  • Mousa et al. (2011) A. E.-D. Mousa, M. A. B. Shaik, R. Schlüter, and H. Ney. 2011. Morpheme Based Factored Language Models for German LVCSR. In Interspeech, pages 1445–1448.
  • Nair and Hinton (2010) V. Nair and G. Hinton. 2010. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning, pages 807–814, Haifa, Israel.
  • Neal (1993) R. Neal. 1993. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto.
  • Neal (2003) R. M. Neal. 2003. Slice Sampling. The Annals of Statistics, 31(3):705–741.
  • Nederhof (2000) M.-J. Nederhof. 2000. Practical Experiments with Regular Approximation of Context-free Languages. Computational Linguistics, 26(1).
  • Neubig et al. (2010) G. Neubig, M. Mimura, S. Mori, and T. Kawahara. 2010. Learning a Language Model from Continuous Speech. In Interspeech, pages 1053–1056, Chiba, Japan, September.
  • Ney et al. (1994) H. Ney, U. Essen, and R. Kneser. 1994. On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language, 8(1):1–38.
  • Och (2003) F. J. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 160–167.
  • Pachitariu and Sahani (2013) M. Pachitariu and M. Sahani. 2013. Regularization and nonlinearities for neural language models: when are they needed? arXiv preprint arXiv:1301.5650.
  • Papineni et al. (2001) K. Papineni, S. Roukos, T. Ward, W.-j. Zhu, T. J. Watson, and Y. Heights. 2001. Bleu: A Method for Automatic Evaluation of Machine Translation. Technical report, IBM.
  • Pascanu et al. (2012) R. Pascanu, T. Mikolov, and Y. Bengio. 2012. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning.
  • Pitman (1995) J. Pitman. 1995. Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields, 102:145–158.
  • Pitman and Yor (1997) J. Pitman and M. Yor. 1997. The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator. The Annals of Probability, 25(2):855–900.
  • Poon et al. (2009) H. Poon, C. Cherry, and K. Toutanova. 2009. Unsupervised morphological segmentation with log-linear models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 209–217. Association for Computational Linguistics.
  • Rodrigues and Ćavar (2007) P. Rodrigues and D. Ćavar. 2007. Learning Arabic Morphology Using Statistical Constraint-Satisfaction Models. In E. Benmamoun, editor, Perspectives on Arabic Linguistics: Proceedings of the 19th Arabic Linguistics Symposium, pages 63–75, Urbana, IL, USA. John Benjamins Publishing Company.
  • Rodríguez and Satta (2009) C. G. Rodríguez and G. Satta. 2009.

    An Optimal-Time Binarization Algorithm for Linear Context-Free Rewriting Systems with Fan-Out Two.

    In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 985–993. Association for Computational Linguistics.
  • Rosenfeld et al. (2001) R. Rosenfeld, S. F. Chen, and X. Zhu. 2001. Whole-sentence exponential language models: a vehicle for linguistic-statistical integration. Computers Speech & Language, 15(1).
  • Rubenstein and Goodenough (1965) H. Rubenstein and J. B. Goodenough. 1965. Contextual Correlates of Synonymy. Communications of the Association for Computing Machinery, 8(10):627–633, October.
  • Schmid and Laws (2008) H. Schmid and F. Laws. 2008.

    Estimation of Conditional Probabilities With Decision Trees and an Application to Fine-Grained POS Tagging.

    In Proceedings of the 22nd International Conference on Computational Linguistics, pages 777–784, Manchester, UK. Association for Computational Linguistics.
  • Schone and Jurafsky (2000) P. Schone and D. Jurafsky. 2000. Knowledge-Free Induction of Morphology Using Latent Semantic Analysis. In Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop, pages 67–72, Lisbon, Portugal.
  • Schwenk (2004) H. Schwenk. 2004. Efficient Training of Large Neural Networks for Language Modeling. In Proceedings of the IEEE Joint Conference on Neural Networks, volume 4, pages 3059–3064. IEEE.
  • Schwenk and Koehn (2008) H. Schwenk and P. Koehn. 2008. Large and Diverse Language Models for Statistical Machine Translation. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II.
  • Schwenk et al. (2006) H. Schwenk, D. Dechelotte, and J.-L. Gauvain. 2006. Continuous Space Language Models for Statistical Machine Translation. In Proceedings of COLING/ACL, pages 723–730, Sydney, Australia. Association for Computational Linguistics.
  • Schwenk et al. (2012) H. Schwenk, A. Rousseau, and M. Attik. 2012. Large, Pruned or Continuous Space Language Models on a GPU for Statistical Machine Translation. In In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, pages 11–19. Association for Computational Linguistics.
  • Seki et al. (1991) H. Seki, T. Matsumura, M. Fujii, and T. Kasami. 1991. On multiple context-free grammars. Theoretical Computer Science, 88(2):191–229.
  • Sethuraman (1994) J. Sethuraman. 1994. A Constructive Definition of Dirichlet Priors. Statistica Sinica, 4:639–650.
  • Sirts and Goldwater (2013) K. Sirts and S. Goldwater. 2013. Minimally-Supervised Morphological Segmentation using Adaptor Grammars. Transactions of the Association of Computational Linguistics, 1:255–266.
  • Smrž (2007) O. Smrž. 2007. Functional Arabic Morphology: Formal System and Implementation. Ph.D. thesis, Charles University, Prague.
  • Snyder and Barzilay (2008) B. Snyder and R. Barzilay. 2008. Unsupervised Multilingual Learning for Morphological Segmentation. In Proceedings of ACL-HLT, pages 737–745.
  • Stolcke (2002) A. Stolcke. 2002. SRILM – An extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, pages 901–904.
  • Stolcke and Omohundro (1994) A. Stolcke and S. M. Omohundro. 1994. Inducing probabilistic grammars by Bayesian model merging. In Grammatical inference and applications, pages 106–118. Springer.
  • Stump (2001) G. Stump. 2001. Inflectional Morphology. Cambridge studies in linguistics. Cambridge University Press, Cambridge, UK.
  • Stymne (2009) S. Stymne. 2009. A comparison of merging strategies for translation of German compounds. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 61–69. Association for Computational Linguistics, April.
  • Teh (2006a) Y. W. Teh. 2006a. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 985–992. Association for Computational Linguistics.
  • Teh (2006b) Y. W. Teh. 2006b. A Bayesian Interpretation of Interpolated Kneser-Ney. Technical Report TRA2/06, School of Computing, National University of Singapore.
  • Turney and Pantel (2010) P. D. Turney and P. Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37:141–188.
  • van der Maaten and Hinton (2008) L. van der Maaten and G. Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9:2579–2605.
  • Vaswani et al. (2013) A. Vaswani, Y. Zhao, V. Fossum, and D. Chiang. 2013. Decoding with Large-Scale Neural Language Models Improves Translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, USA. Association for Computational Linguistics.
  • Vergyri et al. (2004) D. Vergyri, K. Kirchhoff, K. Duh, and A. Stolcke. 2004. Morphology-based Language Modeling for Arabic Speech Recognition. In International Conference on Spoken Language Processing, volume 6.
  • Vijay-Shanker et al. (1987) K. Vijay-Shanker, D. J. Weir, and A. K. Joshi. 1987. Characterizing structural descriptions produced by various grammatical formalisms. In Proceedings of the 25th Annual Meeting of the Association for Computational Linguistics, pages 104–111.
  • Virpioja et al. (2007) S. Virpioja, J. J. Väyrynen, M. Creutz, and M. Sadeniemi. 2007. Morphology-Aware Statistical Machine Translation Based on Morphs Induced in an Unsupervised Manner. In Proceedings of the Machine Translation Summit XI, pages 491–498, September.
  • Wang et al. (2013) R. Wang, M. Utiyama, I. Goto, E. Sumita, H. Zhao, and B.-L. Lu. 2013. Converting Continuous-Space Language Models into N-gram Language Models for Statistical Machine Translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 845–850.
  • Wasserman (2007) L. Wasserman. 2007. All of Nonparametric Statistics. Springer Texts in Statistics.
  • Williams (1981) E. Williams. 1981. On the Notions “Lexically Related” and “Head of a Word”. Linguistic Inquiry, 12(2):245–274.
  • Wood and Teh (2009) F. Wood and Y. W. Teh. 2009. A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, pages 607–614, Clearwater Beach, Florida, USA.
  • Wu et al. (2012) Y. Wu, X. Lu, H. Yamamoto, S. Matsuda, C. Hori, and H. Kashioka. 2012. Factored Language Model based on Recurrent Neural Network. In Proceedings of COLING, pages 2835–2850.
  • Xiao and Guo (2013) M. Xiao and Y. Guo. 2013. Domain Adaptation for Sequence Labeling Tasks with a Probabilistic Language Adaptation Model. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, Georgia.
  • Xu and Rudnicky (2000) W. Xu and A. I. Rudnicky. 2000. Can Artificial Neural Networks Learn Language Models? In Proceedings of the International Conference on Statistical Language Processing,.
  • Zesch and Gurevych (2006) T. Zesch and I. Gurevych. 2006. Automatically creating datasets for measures of semantic relatedness. In Proceedings of the Workshop on Linguistic Distances, pages 16–24. Association for Computational Linguistics.
  • Zipf (1932) G. Zipf. 1932. Selective Studies and the Principle of Relative Frequency in Language. Harvard University Press, Cambridge, MA, USA.
  • Zou et al. (2013) W. Y. Zou, R. Socher, D. Cer, and C. D. Manning. 2013. Bilingual Word Embeddings for Phrase-Based Machine Translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1393–1398, Seattle, USA. Association for Computational Linguistics.