
EmbeddedState Latent Conditional Random Fields for Sequence Labeling
Complex textual information extraction tasks are often posed as sequence labeling or shallow parsing, where fields are extracted using local labels made consistent through probabilistic inference in a graphical model with constrained transitions. Recently, it has become common to locally parametrize these models using rich features extracted by recurrent neural networks (such as LSTM), while enforcing consistent outputs through a simple linearchain model, representing Markovian dependencies between successive labels. However, the simple graphical model structure belies the often complex nonlocal constraints between output labels. For example, many fields, such as a first name, can only occur a fixed number of times, or in the presence of other fields. While RNNs have provided increasingly powerful contextaware local features for sequence tagging, they have yet to be integrated with a global graphical model of similar expressivity in the output distribution. Our model goes beyond the linear chain CRF to incorporate multiple hidden states per output label, but parametrizes their transitions parsimoniously with lowrank logpotential scoring matrices, effectively learning an embedding space for hidden states. This augmented latent space of inference variables complements the rich feature representation of the RNN, and allows exact global inference obeying complex, learned nonlocal output constraints. We experiment with several datasets and show that the model outperforms baseline CRF+RNN models when global output constraints are necessary at inferencetime, and explore the interpretable latent structure.
09/28/2018 ∙ by Dung Thai, et al. ∙ 2 ∙ shareread it

Training for Fast Sequential Prediction Using Dynamic Feature Selection
We present paired learning and inference algorithms for significantly reducing computation and increasing speed of the vector dot products in the classifiers that are at the heart of many NLP components. This is accomplished by partitioning the features into a sequence of templates which are ordered such that high confidence can often be reached using only a small fraction of all features. Parameter estimation is arranged to maximize accuracy and early confidence in this sequence. We present experiments in lefttoright partofspeech tagging on WSJ, demonstrating that we can preserve accuracy above 97
10/30/2014 ∙ by Emma Strubell, et al. ∙ 0 ∙ shareread it

Adding Gradient Noise Improves Learning for Very Deep Networks
Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. This success is partially attributed to architectural innovations such as convolutional and long shortterm memory networks. The main motivation for these architectural innovations is that they capture better domain knowledge, and importantly are easier to optimize than more basic architectures. Recently, more complex architectures such as Neural Turing Machines and Memory Networks have been proposed for tasks including question answering and general computation, creating a new set of optimization challenges. In this paper, we discuss a lowoverhead and easytoimplement technique of adding gradient noise which we find to be surprisingly effective when training these very deep architectures. The technique not only helps to avoid overfitting, but also can result in lower training loss. This method alone allows a fullyconnected 20layer deep network to be trained with standard gradient descent, even starting from a poor initialization. We see consistent improvements for many complex models, including a 72 baseline on a challenging questionanswering task, and a doubling of the number of accurate binary multiplication models learned across 7,000 random restarts. We encourage further application of this technique to additional complex modern architectures.
11/21/2015 ∙ by Arvind Neelakantan, et al. ∙ 0 ∙ shareread it

Unsupervised Hypernym Detection by Distributional Inclusion Vector Embedding
Modeling hypernymy, such as poodle isa dog, is an important generalization aid to many NLP tasks, such as entailment, relation extraction, and question answering. Supervised learning from labeled hypernym sources, such as WordNet, limit the coverage of these models, which can be addressed by learning hypernyms from unlabeled text. Existing unsupervised methods either do not scale to large vocabularies or yield unacceptably poor accuracy. This paper introduces distributional inclusion vector embedding (DIVE), a simpletoimplement unsupervised method of hypernym discovery via perword nonnegative vector embeddings learned by modeling diversity of word context with specialized negative sampling. In an experimental evaluation more comprehensive than any previous literature of which we are aware  evaluating on 11 datasets using multiple existing as well as newly proposed scoring metrics  we find that our method can provide up to double or triple the precision of previous unsupervised methods, and also sometimes outperforms previous semisupervised methods, yielding many new stateoftheart results.
10/02/2017 ∙ by HawShiuan Chang, et al. ∙ 0 ∙ shareread it

LowRank Hidden State Embeddings for Viterbi Sequence Labeling
In textual information extraction and other sequence labeling tasks it is now common to use recurrent neural networks (such as LSTM) to form rich embedded representations of longterm input cooccurrence patterns. Representation of output cooccurrence patterns is typically limited to a handdesigned graphical model, such as a linearchain CRF representing shortterm Markov dependencies among successive labels. This paper presents a method that learns embedded representations of latent output structure in sequence data. Our model takes the form of a finitestate machine with a large number of latent states per label (a latent variable CRF), where the statetransition matrix is factorizedeffectively forming an embedded representation of statetransitions capable of enforcing longterm label dependencies, while supporting exact Viterbi inference over output labels. We demonstrate accuracy improvements and interpretable latent structure in a synthetic but complex task based on CoNLL named entity recognition.
08/02/2017 ∙ by Dung Thai, et al. ∙ 0 ∙ shareread it

Improved Representation Learning for Predicting Commonsense Ontologies
Recent work in learning ontologies (hierarchical and partiallyordered structures) has leveraged the intrinsic geometry of spaces of learned representations to make predictions that automatically obey complex structural constraints. We explore two extensions of one such model, the orderembedding model for hierarchical relation learning, with an aim towards improved performance on text data for commonsense knowledge representation. Our first model jointly learns ordering relations and nonhierarchical knowledge in the form of raw text. Our second extension exploits the partial order structure of the training data to find longdistance triplet constraints among embeddings which are poorly enforced by the pairwise training procedure. We find that both incorporating free text and augmented training constraints improve over the original orderembedding model and other strong baselines.
08/01/2017 ∙ by Xiang Li, et al. ∙ 0 ∙ shareread it

Generating Sentences from a Continuous Space
The standard recurrent neural network language model (RNNLM) generates sentences one word at a time and does not work from an explicit global sentence representation. In this work, we introduce and study an RNNbased variational autoencoder generative model that incorporates distributed latent representations of entire sentences. This factorization allows it to explicitly model holistic properties of sentences such as style, topic, and highlevel syntactic features. Samples from the prior over these sentence representations remarkably produce diverse and wellformed sentences through simple deterministic decoding. By examining paths through this latent space, we are able to generate coherent novel sentences that interpolate between known sentences. We present techniques for solving the difficult learning problem presented by this model, demonstrate its effectiveness in imputing missing words, explore many interesting properties of the model's latent sentence space, and present negative results on the use of the model in language modeling.
11/19/2015 ∙ by Samuel R. Bowman, et al. ∙ 0 ∙ shareread it

Bethe Projections for NonLocal Inference
Many inference problems in structured prediction are naturally solved by augmenting a tractable dependency structure with complex, nonlocal auxiliary objectives. This includes the mean field family of variational inference algorithms, soft or hardconstrained inference using Lagrangian relaxation or linear programming, collective graphical models, and forms of semisupervised learning such as posterior regularization. We present a method to discriminatively learn broad families of inference objectives, capturing powerful nonlocal statistics of the latent variables, while maintaining tractable and provably fast inference using nonEuclidean projected gradient descent with a distancegenerating function given by the Bethe entropy. We demonstrate the performance and flexibility of our method by (1) extracting structured citations from research papers by learning soft global constraints, (2) achieving stateoftheart results on a widelyused handwriting recognition task using a novel learned nonconvex inference procedure, and (3) providing a fast and highly scalable algorithm for the challenging problem of inference in a collective graphical model applied to bird migration.
03/04/2015 ∙ by Luke Vilnis, et al. ∙ 0 ∙ shareread it

Word Representations via Gaussian Embedding
Current work in lexical distributed representations maps each word to a point vector in lowdimensional space. Mapping instead to a density provides many interesting advantages, including better capturing uncertainty about a representation and its relationships, expressing asymmetries more naturally than dot product or cosine similarity, and enabling more expressive parameterization of decision boundaries. This paper advocates for densitybased distributed embeddings and presents a method for learning representations in the space of Gaussian distributions. We compare performance on various word embedding benchmarks, investigate the ability of these embeddings to model entailment and other asymmetric relationships, and explore novel properties of the representation.
12/20/2014 ∙ by Luke Vilnis, et al. ∙ 0 ∙ shareread it

Go for a Walk and Arrive at the Answer: Reasoning Over Paths in Knowledge Bases using Reinforcement Learning
Knowledge bases (KB), both automatically and manually constructed, are often incomplete  many valid facts can be inferred from the KB by synthesizing existing information. A popular approach to KB completion is to infer new relations by combinatory reasoning over the information found along other paths connecting a pair of entities. Given the enormous size of KBs and the exponential number of paths, previous pathbased models have considered only the problem of predicting a missing relation given two entities or evaluating the truth of a proposed triple. Additionally, these methods have traditionally used random paths between fixed entity pairs or more recently learned to pick paths between them. We propose a new algorithm MINERVA, which addresses the much more difficult and practical task of answering questions where the relation is known, but only one entity. Since random walks are impractical in a setting with combinatorially many destinations from a start node, we present a neural reinforcement learning approach which learns how to navigate the graph conditioned on the input query to find predictive paths. Empirically, this approach obtains stateoftheart results on several datasets, significantly outperforming prior methods.
11/15/2017 ∙ by Rajarshi Das, et al. ∙ 0 ∙ shareread it

Finer Grained Entity Typing with TypeNet
We consider the challenging problem of entity typing over an extremely fine grained set of types, wherein a single mention or entity can have many simultaneous and often hierarchicallystructured types. Despite the importance of the problem, there is a relative lack of resources in the form of finegrained, deep type hierarchies aligned to existing knowledge bases. In response, we introduce TypeNet, a dataset of entity types consisting of over 1941 types organized in a hierarchy, obtained by manually annotating a mapping from 1081 Freebase types to WordNet. We also experiment with several models comparable to stateoftheart systems and explore techniques to incorporate a structure loss on the hierarchy with the standard mention typing loss, as a first step towards future research on this dataset.
11/15/2017 ∙ by Shikhar Murty, et al. ∙ 0 ∙ shareread it