Metric Learning for Dynamic Text Classification

11/04/2019 ∙ by Jeremy Wohlwend, et al. ∙ 0

Traditional text classifiers are limited to predicting over a fixed set of labels. However, in many real-world applications the label set is frequently changing. For example, in intent classification, new intents may be added over time while others are removed. We propose to address the problem of dynamic text classification by replacing the traditional, fixed-size output layer with a learned, semantically meaningful metric space. Here the distances between textual inputs are optimized to perform nearest-neighbor classification across overlapping label sets. Changing the label set does not involve removing parameters, but rather simply adding or removing support points in the metric space. Then the learned metric can be fine-tuned with only a few additional training examples. We demonstrate that this simple strategy is robust to changes in the label space. Furthermore, our results show that learning a non-Euclidean metric can improve performance in the low data regime, suggesting that further work on metric spaces may benefit low-resource research.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text classification often assumes a static set of labels. While this assumption holds for tasks such as sentiment analysis and part-of-speech tagging

Pang and Lee (2005); Kim (2014); Brants (2000); Collins (2002); Toutanova et al. (2003), it is rarely true for real-world applications. Consider the example of news categorization in Figure 1 (a). A domain expert may decide that the Sports class should be separated into two distinct Soccer and Baseball sub-classes, and conversely merge the two Cars and Motorcycles classes into a single Auto category. Another example is user intent classification in task-oriented dialog systems. In Figure 1 (b) for example, an intent to redeem a reward can be removed when the option is no longer available, while a new intent to apply free shipping can be added to the system.

In all of these applications, the classifier must remain applicable for dynamic classification, a task where the label set is rapidly evolving.

Figure 1: Examples of dynamic classification. In the hierarchical setting (left), new labels are created by splitting and merging old labels. In the flat setting (right), arbitrary labels can be added or removed.

Several factors make the dynamic classification problem difficult. First, traditional classifiers are not suited to changes in the label space. These classifiers produce a fixed sized output which aligns each of the dimensions to an existing label. Thus, adding or removing any label requires changing the model architecture. Second, while it is possible to retain some model parameters, such as in hierarchical classification models, these architectures must still learn separate weights for every new class or sub-class Cai and Hofmann (2004); Kowsari et al. (2017). This is problematic because the new class labels often come with very few training examples, providing insufficient information for learning accurate model weights. Furthermore, these models do not leverage information across similar labels, which weakens their ability to adapt to new target labels Kowsari et al. (2017); Tsochantaridis et al. (2005); Cai and Hofmann (2004).

We propose to address these issues by learning an embedding function which maps input text into a semantically meaningful metric space. The parameterized metric space, once trained on an initial set of labeled data, can be used to perform classification in a nearest-neighbor fashion (by comparing the distance from the input text to reference texts with known label). As a result, the classifier becomes agnostic to changes in the label set. One remaining design challenge, however, is to learn a representation that best leverages the relationship between old and new labels. In particular, the label split example in Figure 1 (b) shows that new labels are often formed by partitioning an old label. This suggests that the classifier may benefit from a metric space that can better represent the structural relationships between labels. Given the hierarchical relationship between the old and new labels, we choose a space of negative curvature (hyperbolic), which has been shown to better embed tree-like structure  Nickel and Kiela (2017); Sala et al. (2018); Gu et al. (2019).

Our two main contributions are outlined below:

  1. We design an experimental framework for dynamic text classification, and propose a classification strategy based on prototypical networks, a simple but powerful metric learning technique Snell et al. (2017).

  2. We construct a novel prototypical network adapted to hyperbolic geometry. This requires deriving useful prototypes to represent a set of points on the negatively curved Riemannian manifold. We state sufficient theoretical conditions for the resulting optimization problem to converge. To the best of our knowledge, this is the first application of hyperbolic geometry to text classification beyond the word level.

We perform a thorough experimental analysis by considering the model improvements across several aspects – low-resource fine-tuning, impact of pretraining, and ability to learn new classes. We find that the metric learning approach adapts more gracefully to changes in the label distribution, and outperforms traditional, fixed size classifiers in every aspect of the analysis. Furthermore, our proposed hyperbolic prototypical network outperforms its Euclidean counterpart in the low-resource setting, when fewer than 10 examples per class are available.

2 Related Work

Prototypical Networks and Manifold Learning:

This paper builds on the prototypical network architecture Snell et al. (2017), which was originally proposed in the context of few-shot learning. In both their work and ours, the goal is to embed training data in a space such that the distance to prototype centroids of points with the same label define good decision boundaries for labeling test data with a nearest neighbor classifier. Building on earlier work in metric learning Vinyals et al. (2016); Ravi and Larochelle (2017), the authors show that learned prototype points also help the network classify inputs into test classes for which minimal data exists.

This architecture has found success in computer vision applications such as image and video classification 

Weinberger and Saul (2009); Ustinova and Lempitsky (2016); Luo et al. (2017). Very recently, prototypical network architectures have shown promising results on relational classification tasks Han et al. (2018); Gao et al. (2019). To the best of our knowledge, our work is the first application of prototypical network architectures to text classification using non-Euclidean geometry.222Snell et al. (2017) discuss their formulation in the context of Euclidean distance, cosine distance (spherical manifold), and general Bregman divergences; however, classical Bregman divergence does not easily generalize to hyperbolic space (Section 3.3).

Concurrent with the writing of this paper, Khrulkov et al. (2019)

applied several hyperbolic neural networks to few-shot image classification tasks. However, their prototypical network uses the Einstein midpoint rather than the definition of the mean we use in Section 


In Chen et al. (2019) the authors embed the labels and data separately, then predict hierarchical class membership using an interaction model. Our model directly links embedding distances to model predictions, and thus learns an embedded space that is more amenable to low-resource, dynamic classification tasks.

Hyperbolic geometry has been explored in classical works of differential geometry Thurston (2002); Cannon et al. (1997); Berger (2003). More recently, hyperbolic space has been studied in the context of developing neural networks with hyperbolic parameters Ganea et al. (2018b).

In particular, recent work has successfully applied hyperbolic geometry to graph embeddings Sarkar (2011); Nickel and Kiela (2017, 2018); Sala et al. (2018); Ganea et al. (2018a); Gu et al. (2019)

. In all of these prior works, the model’s parameters correspond to node vectors in hyperbolic space that require Riemannian optimization. In our case, only the model’s outputs live in hyperbolic space—not its parameters, which avoids propagating gradients in hyperbolic space and facilitates optimization. This is explained in more detail in Section 


Hierarchical or Few-shot Text Classification:

Many classical models for multi-class classification incorporate a hierarchical label structure Tsochantaridis et al. (2005); Cai and Hofmann (2004); Yen et al. (2016); Naik et al. (2013); Sinha et al. (2018)

. Most models proceed in a top-down manner: a separate classifier (logistic regression, SVM, etc.) is trained to predict the correct child label at each node in the label hierarchy.

For instance, HDLTex Kowsari et al. (2017) addresses large hierarchical label sets explicitly by training a stacked, hierarchical neural network architecture. Such approaches do not scale well to deep and large label hierarchies, while our method can adapt to more flexible settings, such as adding or removing labels, without adding extra parameters.

Our work also relates to text classification in a low-resource setting. While a wide range of methods improve accuracy by leveraging external data such as multi-task training Miyato et al. (2016); Chen et al. (2018); Yu et al. (2018); Guo et al. (2018), semi-supervised pretraining Dai and Le (2015), and unsupervised pretraining Peters et al. (2018); Devlin et al. (2018), our method makes use of the structure of the data via metric learning. As a result, our method can be easily combined with any of these methods to further improve model performance.

3 Model Framework

This section provides the details of each component of our framework, starting with a more detailed formulation of dynamic classification. We then provide some background on prototypical networks, before introducing our hyperbolic variant and its theoretical guarantees.

3.1 Dynamic Classification

Formally, we formulate dynamic classification as the following problem: given access to an old, labeled training corpus , we are interested in training a classifier with a few examples . Unlike few-shot learning, the old and new datasets need not be disjoint (, ).

We consider two different cases: 1) new labels arrive as a consequence of new input data , and 2) during label splitting/merging, some new examples may be constructed by relabeling old examples from to . This latter case is of particular interest as the classifier may be able to leverage its knowledge of old labels in learning to classify new ones.

There are many natural approaches to this problem. First, a fixed model trained on may be applied directly to classify examples in , which we refer to as an un-tuned model. Alternately, a pretrained model may also be fine-tuned on . Finally, it is also possible to train from scratch on , disregarding the model weights trained on the old data distribution. We compare strategies in Sections 45.

3.2 Episodic Training

The standard prototypical network is trained using episodic training, as described in Snell et al. (2017). We view our model as an embedding function which takes textual inputs and outputs points in the metric space.

Let denote the distance between two points and in our metric space, and let denote our embedding function. At each iteration, we form a new episode by sampling a set of target labels, as well as support and query points for each of the sampled labels. Let , , and , be the number of classes tested, the number of support points used, and the number of query points used in each episode, respectively.

For each episode, we first sample classes, , uniformly across all training labels. We then build a set of support points for each of the selected classes by sampling training examples from each selected class. For each support set, we compute a prototype vector . For the standard Euclidean prototypical network, we use the mean of the embedded support set:


To compute the loss for an episode, we further sample query points which do not appear in the support set of the episode, for each selected class

. We then encode each query sequence and apply a softmax function over the negative distances from the query points to the episode’s class prototypes. This yields a probability distribution over classes, and we take the negative log probability of the true class, averaged over the query points, to get the loss for the episode.

where in the denominator ranges from to . The steps of a single episode are summarized in Algorithm 1.

Once episodic training is finished, the prototype vectors for a class can be computed as the mean of the embeddings of any number of items in the class. In our experiments, we use the whole training set to compute the final class prototypes, but under lower resources, fewer support points could also be used.

Input: – set of pairs

– all pairs with

– number of classes sampled each episode

– number of support points

– number of query points

1:procedure Episode(, ,, )
3:     for  do
7:      Concat(; ; …; )
8:      0
9:     for  each  do
Algorithm 1 Prototypical Training Episode

3.3 Hyperbolic Prototypical Networks

In this section we discuss the hyperbolic prototypical network which can better model structural relationships between labels. We first review the hyperboloid model of hyperbolic space and its distance formula. Then we describe the main technical challenge of computing good prototypes in hyperbolic space. Proofs of our uniqueness and convergence can be found in the Appendix  A. We also describe a second, distinct method for computing prototypes which is used to initialize our main method during experiments. A detailed discussion of this point is provided in Appendix B.

Hyperbolic space can be interpreted as a continuous analogue of a tree Cannon et al. (1997); Krioukov et al. (2010). While trees on vertices can be embedded in Euclidean space with dimensions with minimal distortion, hyperbolic space needs only 2 dimensions. Additionally, the circumference of a hyperbolic disk grows exponentially with its radius. Therefore, hyperbolic models have room to place many prototypes equidistant from a common parent while maintaining separability from other classes. We argue that this property helps text classification with latent hierarchical structures (e.g. dynamic label splitting).

The reader is referred to Section 2.6 of Thurston (2002) for a detailed introduction to hyperbolic geometry, and to Cannon et al. (1997) for a more gentle introduction. In this section we have adopted the sign convention of Sala et al. (2018).

Hyperbolic space in dimensions is the unique, simply connected, -dimensional, Riemannian manifold with constant curvature . The hyperboloid (or Lorentz) model realizes -dimensional hyperbolic space as an isometric embedding inside endowed with a signature bilinear form.

Specifically, let the coordinates of any be . Then we can define a bilinear form on by


which allows us to define the hyperboloid to be the set . We induce a Riemannian metric on the hyperboloid by restricting to the hyperboloid’s tangent space. The resulting Riemannian manifold is hyperbolic space . For the hyperbolic distance is given by


There are several equivalent ways of defining hyperbolic space. We choose to work primarily in the hyperboloid model over other models (e.g. Poincaré disk model) for improved numerical stability. We use the -dimensional output vector of our network and project it on the hyperboloid embedded in dimensions:


A key algorithmic difference between the Euclidean and the hyperbolic model is the computation of prototype vectors. There are multiple definitions that generalize the notion of a mean to general Riemannian manifolds.

One sensible mean of a set is given by the point which minimizes the sum of squared distances to each point in .


A proof for the following proposition can be found in Appendix  A. We note that concurrent with the writing of this paper, a generalized version of our result appeared in Gu et al. (2019) as Lemma 2.

Proposition 1.

Every finite collection of points in has a unique mean . Furthermore, solving the optimization problem (5) with Riemannian gradient descent will converge to .

In an effort to derive a closed form for (rather than solve a Riemannian optimization problem), we found the following expression to be a good approximation. It is computed by averaging the vectors in and scaling them by the constant which projects this average back to the hyperboloid:


can be shown to differ through a simple counterexample, although in practice we find little difference between their values during experiments. The proof is be provided in Appendix B.

3.4 Implementation and Stability

Our final hyperbolic prototypical model combines both definitions with the following heuristic: initialize problem (

5) with

and then run several iterations of Riemannian gradient descent. We find that it is possible to backpropagate through a few steps of the gradient descent procedure described above during prototypical model training. However, we also find that the model can be trained successfully when detaching the gradients with respect to the support points. This suggests that prototypical models can be trained in metric spaces where the mean or its gradient cannot be computed efficiently. Further experimental details are provided in the next section.

Our prototypical network loss function uses both squared Euclidean distance and squared hyperbolic distance for similar reasons. Namely, the distance between two close points is much less numerically stable than the squared distance. In the Euclidean case, the derivative of

is undefined at zero. In the hyperbolic case, the derivative of at is undefined, and for points on the hyperboloid. If we instead use the squared hyperbolic distance, L’Hôpital’s rule implies that the derivative of as is , allowing gradients to backpropagate through the squared hyperbolic distance without issue.

4 Experiments

We evaluate the performance of our framework on several text classification benchmarks, two of which exhibit a hierarchical label set. We only use the label hierarchy to simulate the label splitting discussed in Figure 1 (a). The models are not trained with explicit knowledge of the hierarchy, as we assume that the full hierarchy is not known a priori in the dynamic classification setting. A description of the datasets is provided below:

  • 20 Newsgroups (NEWS): This dataset is composed of nearly 20,000 documents, distributed across 20 news categories. We use the provided label hierarchy to form the depth tree used throughout our experiments. We use 9,044 documents for training, 2,668 for validation, and 7,531 for testing.

  • Web of Science (WOS): This dataset was used in two previous works on hierarchical text classification Kowsari et al. (2017); Sinha et al. (2018). It contains 134 topics, split across 7 parent categories. It contains 46,985 documents collected from the Web of Science citation index. We use 25,182 documents for training, 6,295 for validation, and 15,503 for testing.

  • Twitter Airline Sentiment (SENT): This dataset consists of public tweets from customers to American-based airlines labeled with one of reasons for negative sentiment (e.g. Late Flight, Lost Luggage).333 We preprocess the data by keeping only the negative tweets with confidence over . This dataset is non-hierarchical and composed of nearly 7500 documents. We use 5,975 documents for training, 742 for validation, and 754 for testing.

Dataset Model
Table 1: Test accuracy for each dataset and method. Columns indicate the number of examples per label used in the fine tuning stage. In all cases, the prototypical models outperform the baseline. The hyperbolic model performs best in the low data regime, but both metrics perform comparably when data is abundant.

Dynamic Setup:

We construct training data for the task of dynamic classification as follows. First, we split our training data in half. The first half is used for pretraining and the second for fine-tuning. To simulate a change in the label space, we randomly remove fraction of labels in the pretraining data. This procedure yields two label sets, with (pretraining) (fine-tuning). In our experiments, we further vary the amount of data available in the fine-tuning set. For the flat dataset, the labels to be removed are sampled uniformly. In the hierarchical case, we create by randomly collapsing leaf labels into their parent classes, as shown previously in Figure 1.

Hyperparameters and Implementation Details:

We apply the same encoder architecture throughout all experiments. We use a 4 layer recurrent neural network, with SRU cells 

Lei et al. (2018) and a hidden size of 128. We use pretrained GloVe embeddings Pennington et al. (2014)

, which are fixed during training. A sequence level embedding is computed by passing a sequence of word embeddings through the recurrent encoder, and taking the embedding for the last token to represent the sequence. We use the ADAM optimizer with default learning of 0.001, and train for 100 epochs for the baseline models and 10,000 episodes for the prototypical models, with early stopping. In our experiments, we use

, . We use the full label set every episode for all datasets except WOS, for which we use . We use a dropout rate of 0.5 on NEWS and SENT, and 0.3 for the larger WOS dataset. We tuned the learning rate and dropout for each model on a held-out validation set.

For the hyperbolic prototypical network, we follow the initialization and update procedure outlined at the end of Section 3.3 with iterations of Riemannian gradient descent during training and iterations during evaluation.

We utilize negative squared distance in the softmax computation in order to improve numerical stability. The means are computed via (5) during both training and model inference. However, this computation is treated as a constant during backpropagation as described in Section 3.3.


Our baseline model consists of the same recurrent encoder and an extra linear output layer which computes the final probabilities over the target classes.

In order to fine-tune this multilayer perceptron (MLP) model on a new label ontology, we reuse the encoder, and learn a new output layer. This differs from the prototypical models for which the architecture is kept unchanged.


We evaluate the performance of our models using accuracy with respect to the new label set . We also highlight accuracy on only the classes introduced during label addition/splitting, i.e. . All results are averaged over random label splits with .


Table 1 shows the accuracy of the fine tuned models for all three methods. The SENT dataset shows performance in the case where completely new labels are added during fine tuning. In the NEWS and WOS datasets new labels originate from the splits of old labels.

In all cases, the prototypical models outperform the baseline MLP model significantly, especially when the data in the new label distribution is in the low-resource regime (+5–15% accuracy). We also see an increase in performance in the high data regime of up to 5%.

Table 1 further shows that the hyperbolic model outperforms its Euclidean counterpart in the low data regime on the NEWS and WOS datasets. This is consistent with our hypothesis (and previous work) that hyperbolic geometry is well suited for hierarchical data. Interestingly, the hyperbolic model also performs better on the non-hierarchical SENT dataset when given few examples, which implies that certain metric spaces may be generally stronger in the low-resource setting. In the high data regime, however, both prototypical models perform comparably.

5 Analysis

In this section, we examine several aspects of our experimental setup more closely, and use the SENT and NEWS datasets for this analysis. Results on WOS can be found in the Appendix  E.

Benefits of Pretraining

(a) SENT
(b) NEWS
Figure 2: Accuracy gains from pretraining as a function of the number of examples per class available in the new label distribution. While the models are comparable in the pretraining stage (solid bards), the prototypical models make better use of pretraining, showing higher gain during fine-tuning in both the low and high data data regimes (translucent bars).

We wish to isolate the effect of pretraining on an older label set by measuring the performance of our models on the new label distribution with and without pretraining. Figure 2 shows accuracy without pretraining as solid bars, with the gains due to pretraining shown as translucent bars above them. In the low-data regime without pretraining, all models often perform similarly. Nevertheless, our models do improve substantially over the baseline once pretraining is introduced.

With only a few new examples, our models better leverage knowledge gained from old pretraining data. On the NEWS dataset in particular, with only 5 fine-tune examples per class, the relative reduction in classification error for metric learning models exceeds (Euclidean) and (hyperbolic), while the baseline only reduces relative error by about . This shows that the prototypical network, and particularly the hyperbolic model can adapt more quickly to dynamic label shifts. Furthermore, the prototypical models conserve their advantage over the baseline in the high data regime, though the margins become smaller.

Model 5 10 20 100
MLP (un-tuned) 38.2 46.7 42.4 46.3
EUC 39.6 45.5 47.7 62.7
EUC (un-tuned) 43.4 51.2 47.6 55.8
HYP 42.2 47.1 53.0 62.7
HYP (un-tuned) 45.7 52.4 53.3 53.1
(a) SENT
Model 5 10 50 500
MLP (un-tuned) 29.5 34.6 40.3 42.7
EUC 56.5 65.6 74.2 79.8
EUC (un-tuned) 53.2 56.5 59.6 60.8
HYP 64.8 69.7 72.9 78.8
HYP (un-tuned) 60.1 62.9 65.4 66.7
(b) NEWS
Table 2: Test accuracy for each dataset and method. Columns indicate the number of examples per label used for fine-tuning and/or creating prototype vectors.

Benefits of Fine-tuning

An important advantage of the prototypical model is its ability to predict classes that were unseen during training with as few as a single support point for the new class. A natural question is whether fine-tuning on these new class labels immediately improves performance, or whether fine-tuning should only be done once a significant amount of data has been obtained from the new distribution. We study this question by comparing the performance of tuned and un-tuned models on the new label distribution.

Table 2 compares the accuracy of two types of pretrained prototypical models provided with a variable number of new examples. The fine-tuned model uses this data for both additional training and for constructing new prototypes. The un-tuned

model constructs prototypes using the pretrained model’s representations without additional training. We also construct an un-tuned MLP baseline by fitting a nearest neighbor classifier (KNN, k=5) on the encodings of the penultimate layer of the network. We experimented with fitting the KNN on the output predictions but found that using the penultimate layer was more effective.

We find that the models generally benefit from fine-tuning once a significant amount of data for the new classes is provided ( 20). In the low data regime, however, the results are less consistent, and suggests that the performance may be very dataset dependant. We note however that all metric learning models significantly outperform the MLP-KNN baseline in both the low and high data regimes. This shows that regardless of fine-tuning, our approach is more robust on previously unseen classes.

Learning New Classes

An important factor in the dynamic classification setup is the ability for the model to not only keep performing well on the old classes, but also to smoothly adapt to new ones. We highlight the performance of the models on the newly introduced labels in Figure 3, where we see that the improvement in accuracy is dominated by the performance on the new classes. Plots for additional datasets are shown in Appendix E.

(a) Accuracy with respect to the full label set
(b) Accuracy with respect to new classes only
Figure 3: Accuracy on the NEWS Dataset against number of fine tune examples: (a) all classes and (b) newly introduced classes only. The mean is taken over 5 random label splits, and error bars are given at standard deviation.

6 Conclusions

We propose a framework for dynamic text classification in which the label space is considered flexible and subject to frequent changes. We apply a metric learning method, the prototypical network, and demonstrate its robustness for this task in a variety of data regimes. Motivated by the idea that new labels often originate from label splits, we extend prototypical networks to hyperbolic geometry, derive expressions for hyperbolic prototypes, and demonstrate the effectiveness of our model in the low-resource setting. Our experimental findings suggest that metric learning improves dynamic text classification models, and offer insights on how to combine low-resource training data from overlapping label sets. In the future we hope to explore other applications of metric learning to low-resource research, possibly in combination with explicit models for label entailment (tree learning, fuzzy sets), and/or Wasserstein distance.


  • M. Berger (2003) A panoramic view of riemannian geometry. Springer-Verlag Berlin Heidelberg, Heidelberg, Germany. Cited by: §2, Proposition 2.
  • T. Brants (2000) TnT: a statistical part-of-speech tagger. In

    Proceedings of the sixth conference on Applied natural language processing

    pp. 224–231. Cited by: §1.
  • L. Cai and T. Hofmann (2004)

    Hierarchical document categorization with support vector machines

    In CKIM, pp. 78–87. Cited by: §1, §2.
  • J. Cannon, W. Floyd, R. Kenyon, and W. Parry (1997) Hyperbolic geometry. Note: Cited by: Appendix A, §2, §3.3, §3.3.
  • B. Chen, X. Huang, L. Xiao, Z. Cai, and L. Jing (2019) Hyperbolic interaction model for hierarchical multi-label classification. Note: Cited by: §2.
  • X. Chen, Y. Sun, B. Athiwaratkun, C. Cardie, and K. Weinberger (2018) Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics 6 (). Cited by: §2.
  • M. Collins (2002)

    Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms

    In EMNLP, pp. 1–8. Cited by: §1.
  • A. M. Dai and Q. V. Le (2015) Semi-supervised sequence learning. In NeurIPS, pp. 3079–3087. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §2.
  • O. Ganea, G. Bécigneul, and T. Hofmann (2018a) Hyperbolic entailment cones for learning hierarchical embeddings. In ICML, Cited by: §2.
  • O. Ganea, G. Bécigneul, and T. Hofmann (2018b) Hyperbolic neural networks. In NeurIPS, Cited by: §2.
  • T. Gao, X. Han, Z. Liu, and M. Sun (2019) Hybrid attention-based prototypical networks for noisy few-shot relation classification. In AAAI, Cited by: §2.
  • A. Gu, F. Sala, B. Gunel, and C. Ré (2019) Learning mixed-curvature representations in products of model spaces. In ICLR, Cited by: §1, §2, §3.3.
  • J. Guo, D. Shah, and R. Barzilay (2018) Multi-source domain adaptation with mixture of experts. In EMNLP, Cited by: §2.
  • X. Han, H. Zhu, P. Yu, Z. Wang, Y. Yao, Z. Liu, and M. Sun (2018) FewRel: a large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In EMNLP, Cited by: §2.
  • V. Khrulkov, L. Mirvakhabova, E. Ustinova, I. Oseledets, and V. Lempitsky (2019) Hyperbolic image embeddings. Note: Cited by: §2.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In EMNLP, pp. 1746–1751. Cited by: §1.
  • K. Kowsari, D. E. Brown, M. Heidarysafa, K. J. Meimandi, M. S. Gerber, and L. E. Barnes (2017)

    HDLTex: hierarchical deep learning for text classification

    In IEEE ICMLA, pp. 364–371. Cited by: §1, §2, 2nd item.
  • D. Krioukov, F. Papadopoulos, M. Kitsak, A. Vahdat, and M. Boguñá (2010) Hyperbolic geometry of complex networks. Phys. Rev. E 82. Cited by: §3.3.
  • T. Lei, Y. Zhang, S. I. Wang, H. Dai, and Y. Artzi (2018) Simple recurrent units for highly parallelizable recurrence. In EMNLP, Cited by: §4.
  • Z. Luo, Y. Zou, J. Hoffman, and L. Fei-Fei (2017) Label efficient learning of transferable representations across domains and tasks. In NeurIPS, Cited by: §2.
  • T. Miyato, A. M. Dai, and I. Goodfellow (2016) Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725. Cited by: §2.
  • A. Naik, A. Charuvaka, and H. Rangwala (2013) Classifying documents within multiple hierarchical datasets using multi-task learning. In

    IEEE International Conference on Tools with Artificial Intelligence

    Cited by: §2.
  • M. Nickel and D. Kiela (2018) Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In ICML, pp. 3776–3785. Cited by: §2.
  • M. Nickel and D. Kiela (2017) Poincaré embeddings for learning hierarchical representations. In NeurIPS, pp. 6338–6347. Cited by: §1, §2.
  • B. Pang and L. Lee (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the annual meeting on assocation for computational lingustics, pp. 115–224. Cited by: §1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: §4.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In NAACL, Cited by: §2.
  • S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In ICLR, Cited by: §2.
  • F. Sala, C. De Sa, A. Gu, and C. Ré (2018) Representation tradeoffs for hyperbolic embeddings. In ICML, pp. 4460–4469. Cited by: §1, §2, §3.3.
  • R. Sarkar (2011) Low distortion Delaunay embedding of trees in hyperbolic plane. In Proc. of the International Symposium on Graph Drawing (GD 2011), pp. 355–366. Cited by: §2.
  • K. Sinha, Y. Dong, J. C. K. Cheung, and D. Ruths (2018) A hierarchical neural attention-based text classifier. In EMNLP, pp. 817–823. Cited by: §2, 2nd item.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In NeurIPS, pp. 4077–4087. Cited by: item 1, §2, §3.2, footnote 2.
  • W. Thurston (2002) The geometry and topology of three-manifolds: chapter 2, elliptic and hyperbolic geometry. Note: Cited by: §2, §3.3.
  • K. Toutanova, D. Klein, C. D. Manning, and Y. Singer (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In NAACL, pp. 173–180. Cited by: §1.
  • I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun (2005) Large margin methods for structured and interdependent output variables. JMLR 6, pp. 1453–1484. Cited by: §1, §2.
  • E. Ustinova and V. Lempitsky (2016) Learning deep embeddings with histogram loss. In NeurIPS, pp. 4170–4178. Cited by: §2.
  • O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. In NeurIPS, pp. 3630–3638. Cited by: §2.
  • K. Weinberger and L. Saul (2009) Distance metric learning for large margin nearest neighbor classification.

    Journal of Machine Learning Research

    10, pp. 207–244.
    Cited by: §2.
  • I. E. Yen, X. Huang, P. Ravikumar, K. Zhong, and I. Dhillon (2016) Pd-sparse: a primal and dual sparse approach to extreme multiclass and multilabel classification. In ICML, pp. 3069–3077. Cited by: §2.
  • M. Yu, X. Guo, J. Yi, S. Chang, S. Potdar, Y. Cheng, G. Tesauro, H. Wang, and B. Zhou (2018) Diverse few-shot text classification with multiple metrics. In NAACL, Cited by: §2.
  • H. Zhang and S. Sra (2016) First-order methods for geodesically convex optimization. In COLT, Vol. 49, pp. 1617–1638. Cited by: Appendix A.

Appendix A Proof of Proposition 1


Every finite collection of points in has a unique mean . Furthermore, solving the optimization problem (5) with Riemannian gradient descent will converge to .

The idea of the proof is to use a known result, which states that our optimization target is strictly convex under the assumption that is contained in a compact ball. In order to use this result, we show that any ball in hyperbolic space is geodesically convex.

Proposition 2 (Proposition 60 in Berger (2003)).

Given a manifold , a compact set contained within a convex ball, and a mass distribution with total mass one, the function is strictly convex and achieves a unique minimizer.

To use this proposition for our purposes, let the distribution

be the uniform distribution over the points

. We must show that is contained inside a convex ball. Let be any point in hyperbolic space, and let . Then is contained within a ball of radius about , which we will call .

We need to show that is geodesically convex. In the hyperboloid model, geodesics coincide with the intersection of planes through the origin with the hyperboloid (Cannon et al. (1997), page 80).

Given points , we need that the geodesic segment between and is contained in . This geodesic segment is precisely the projection onto the hyperboloid through the origin of the line segment for , where the sum is in the ambient real vector space. This projection is given by .

By the hyperbolic distance formula (3), the ball of radius centered on is precisely all points in the hyperboloid such that . Then

where we have used the fact that the normalizing constant is greater than or equal to , the linearity of , and the fact that endpoints and are in . Hence the geodesic segment is completely contained in , and the proposition gives us strict convexity of , as well as existence and uniqueness of . Strict convexity of a smooth function on a compact ball gives strong convexity on a slightly smaller ball, smoothness on a compact ball gives Lipschitz continuity of the gradient, and hyperbolic space has constant curvature , making it a Hadamard manifold, hence Riemannian gradient descent will converge to a minimum assuming a sufficiently small step size (Zhang and Sra (2016), theorem 13).

Appendix B Two Differing Means

Here we show that the mean given by (5) differs from that of (6). Let and . Let . Then both these means lie on the geodesic between and . We can compute the hyperbolic distance between and via (3):

Then the minimizer of is the point on the geodesic at distance from . Utilizing (6) we get:

Computing the distance from to using (3) we get:

Hence the two means are not equivalent.

Note that the data set we used to provide this example involves a very skewed set, in which twice as many points are on one side as the other, yet the difference between the two is less than

. We conjecture that the difference can be bounded tightly by a function of and the diameter of its convex hull; proving this claim is an interesting open question for future work.

Appendix C 20 Newsgroups Hierarchy

    GRAPHICS (leaf)
      WINDOWSX: (leaf)
      WINDOWSOS (leaf)
      PC (leaf)
      MAC (leaf)
      AUTOS (leaf)
      MOTORCYCLES (leaf)
      BASEBALL (leaf)
      HOCKEY (leaf)
    CRYPT (leaf)
    ELECTRONICS (leaf)
    MEDECIN (leaf)
    SPACE (leaf)
    GUNS (leaf)
    MIDEAST (leaf)
    CHRISTIAN (leaf)
    ATHEISM (leaf)
  FORSALE (leaf)

Appendix D Dataset Details

Here we present additional details for the datasets used in our experiments.

Name # Train # Dev # Test # Labels Depth
Newsgroups (NEWS)
Web of Science (WOS)
Twitter Airline Sentiment (SENT) Flat
Table 3: Datasets for Sections 45. Half the training documents are used for pretraining on .

Appendix E Additional Experiments

Here we present results from the analysis experiments of Sections 4 and 5 on additional datasets.

Figure 4: Accuracy on the SENT Dataset against number of fine tune examples: (a) all classes and (b) newly introduced classes only. The mean is taken over 5 random label splits, and error bars are given at standard deviation.
Figure 5: Accuracy on the WOS Dataset against number of fine tune examples: (a) all classes and (b) newly introduced classes only. The mean is taken over 5 random label splits, and error bars are given at standard deviation.