. These techniques require a huge amount of labeled training data to achieve optimal performance. This is a severe bottleneck as obtaining large amounts of hand-labeled data is an expensive process. Zero-shot learning is a training strategy which allows a machine learning model to predict novel classes without the need for any labeled examples for the new classesRomera-Paredes and Torr (2015); Socher et al. (2013); Wang et al. (2019). Zero-shot learning models learn parameters for seen classes along with their class representations. During inference, new class representations are provided for the unseen classes. Previous zero-shot learning systems have used attributes Akata et al. (2015); Farhadi et al. (2009); Lampert et al. (2014), pretrained embeddings Frome et al. (2013) and learnable embeddings (e.g. sentence embeddings) Xian et al. (2016) as class representations.
We need to consider several factors while designing a zero-shot learning framework: (1) it should adapt to unseen classes without requiring additional human effort, (2) it should provide rich features such that the unseen classes have sufficient distinguishing characteristics among themselves, (3) it should be applicable to a range of downstream tasks.
Previous approaches for class representations have various limitations. On one end of the spectrum, attribute-based methods provide rich features but curating attributes for each class is a cumbersome process and the attributes have to be decided ahead of time for the unseen classes. On the other end of the spectrum, pretrained embeddings such as GloVe Pennington et al. (2014) and Word2Vec Mikolov et al. (2013) offer the flexibility of easily adapting to new classes but rely on unsupervised training on large corpora—which may not provide distinguishing characteristics necessary for zero-shot learning. Recent approaches using graph neural networks for zero-shot object classification have achieved state-of-the-art performance Kampffmeyer et al. (2019); Wang et al. (2018). GCNZ Wang et al. (2018) and DGP Kampffmeyer et al. (2019)
train a graph neural network on the ImageNet graph to generate class representations. However, in Section4 we show that these methods often do not perform well beyond object classification.
In our work, we propose to learn class representations from common sense knowledge graphs. Common sense knowledge graphs Liu and Singh (2004); Speer et al. (2017); Tandon et al. (2017); Zhang et al. (2020) represent high-level knowledge implicit to humans as concept nodes. These graphs have explicit edges between related concept nodes and provide valuable information to distinguish between different concepts. However, adapting existing zero-shot learning frameworks to learn class representations from common sense knowledge graphs can be problematic in several ways. GCNZ Wang et al. (2018) learns graph neural networks with a symmetrically normalized graph Laplacian, which not only requires the entire graph structure during training but also needs retraining if the graph structure changes. DGP Kampffmeyer et al. (2019) aims to generate expressive class representations and uses an asymmetrically normalized graph Laplacian but assumes a directed acyclic graph such as WordNet.
To address these limitations, we propose ZSL-KG, a new framework based on graph neural networks. Graph neural networks learn to represent the structure of graphs by aggregating information from each node’s neighbourhood. Aggregation techniques used in GCNZ, DGP, and most other graph neural network approaches are linear, in the sense that they take a (possibly weighted) mean or maximum of the neighbourhood features. On the other hand, non-linear aggregators such as LSTMs Hochreiter and Schmidhuber (1997) can potentially generate more expressive features. Some recent works Hamilton et al. (2017a); Murphy et al. (2019) that have considered LSTM aggregators achieved competitive performance in comparison to linear aggregators for relatively low-dimensional tasks such as node classification. We find that non-linear graph aggregators based on LSTMs and transformers are particularly beneficial for zero-shot learning. However, an LSTM is not uniformly the best performer. To address this challenge and make ZSL-KG more flexible, we introduce a transformer Vaswani et al. (2017) aggregator for graph neural networks. Additionally, our framework is inductive, i.e., the graph neural network can be executed on graphs that are different from the training graph, which is necessary for inductive zero-shot learning under which the test classes are unknown during training.
We demonstrate the effectiveness of our framework on three zero-shot learning tasks in language and vision: intent classification, fine-grained entity typing, and object classification. On the language tasks, we observe an average improvement of 9.2 points of accuracy on three datasets over the existing state-of-the-art benchmarks. On object classification, we find a 2.2 accuracy point improvement versus the best methods that do not require hand-engineered class representations Kampffmeyer et al. (2019); Wang et al. (2018). Finally, we perform a study on the choice of neighbourhood aggregator in our ZSL-KG framework. We find that non-linear aggregators perform significantly better than linear aggregators on zero-shot learning by an average 3.8 accuracy points. In summary, our main contributions are the following:
We propose learning zero-shot class representations from common sense knowledge graphs.
We present ZSL-KG, a framework for for zero-shot learning based on graph neural networks with non-linear aggregators. To increase the flexibility of ZSL-KG, we introduce a novel transformer-based aggregator for graph neural networks.
ZSL-KG achieves new state-of-the-art scores on language tasks for the SNIPS-NLU Coucke et al. (2018), FIGER Ling and Weld (2012), and Ontonotes Gillick et al. (2014) datasets. It also achieves a new state-of-the art score among methods that do not use hand-engineered class representations on object classification for Animals with Attributes 2 Xian et al. (2018), and is 1 point away from the overall state-of-the-art.
In this section, we summarize zero-shot learning and graph neural networks.
2.1 Zero-Shot Learning
Zero-shot learning has several variations Wang et al. (2019); Xian et al. (2018). Our work focuses on inductive zero-shot learning, under which we do not have access to the unseen classes during training. Given a training set with
the set of seen classes, we aim to learn a classifier. Unlike traditional classification tasks, zero-shot learning systems are trained along with class representations such as attributes, pretrained embeddings, etc.
Recent approaches learn a class encoder
to produce vector-valued class representations from an initial input, such as a string or other identifier of the class. (In our case,is a node in a graph and its -hop neighborhood.) During inference, the class representations are used to label examples with the unseen classes by passing the examples through an example encoder and predicting the class whose representation has the highest inner product with the example representation.
Recent work in zero-shot learning commonly uses one of two approaches to learn the class encoder . One approach uses a bilinear similarity function defined by a compatibility matrix Frome et al. (2013); Xian et al. (2018):
The bilinear similarity function gives a score for each example-class pair. The parameters of , , and are learned by taking a softmax over for all possible seen classes and minimizing either the cross entropy loss or a ranking loss with respect to the true labels. In other words, should give a higher score for the correct class(es) and lower score for the incorrect classes. is often constrained to be low rank, to reduce the number of learnable parameters Obeidat et al. (2019); Yogatama et al. (2015). Lastly, other variants of the similarity function add minor variations such as non-linearities between factors of Socher et al. (2013); Xian et al. (2016).
The other common approach is to first train a neural network classifier in a supervised fashion. The final fully connected layer of this network has a vector representation for each seen class, and the remaining layers are used as the example encoder . Then, the class encoder
is trained by minimizing the L2 loss between the representations from supervised learning andKampffmeyer et al. (2019); Socher et al. (2013); Wang et al. (2018).
The class encoder that we propose in Section 3 can be plugged into either approach.
2.2 Graph Neural Networks
The basic idea behind graph neural networks is to learn node embeddings that reflects the structure of the graph Hamilton et al. (2017b). Consider the graph , where is the set of vertices with node features and are the labeled edges and are the relation types. Graph neural networks learn node embeddings by iterative aggregation of the k-hop neighbourhood. Each layer of a graph neural network has two main components and Xu et al. (2019):
where is the aggregated node feature of the neighbourhood, is the node feature in neighbourhood of node . The aggregated node is passed to the to generate the node representation for the -th layer:
where is the initial feature vector for the node. Previous works on graph neural networks for zero-shot learning have used GloVe Pennington et al. (2014) to represent the initial features Kampffmeyer et al. (2019); Wang et al. (2018).
3 Zero-shot Learning with Common Sense Knowledge Graphs
The figure describes the aggregate and the combine function used in our graph neural network. The neighbourhood nodes are passed through the non-linear aggregate function to generate the aggregated node feature. The aggregated node feature is passed to the combine function to generate the node embedding. Optionally, we can pass the node’s feature into the combine function from the previous layer which serves as a form of residual connection. These operations are recursively computed for k-hops to generate the class representation.
Here we introduce ZSL-KG: a framework based on graph neural networks with common sense knowledge graphs for zero-shot learning. We first highlight the issues while modeling common sense knowledge graphs. Then, we address these challenges with inductive graph neural networks with non-linear aggregators.
Common sense knowledge graphs organize high-level knowledge implicit to humans in a graph. The nodes in the graph can be abstract concepts as well as definitional concepts. For example, the concept politician is associated with definitional concepts such as mayor and statesman as well as common sense concepts running_town, town_hall, etc. The nodes are linked with edges associated with relation types. For instance, (mayor, IsA, politician) has the relation IsA linking mayor to politician. These associations in the graph offer a rich source of information, which makes common sense knowledge graphs applicable to a wide range of tasks. However, because they are rich, they are also large scale. Publicly available common sense knowledge graphs range roughly from 100,000 to 8 million nodes and 2 million to 21 million edges Speer et al. (2017); Zhang et al. (2020).
To learn class representations from common sense knowledge graphs, we look to graph neural networks. Existing methods from zero-shot object classification such as GCNZ and DGP cannot be adapted to common sense knowledge graphs as they do not scale to large graphs or require a directed acyclic graph. Other general-purpose graph neural networks that use linear aggregators to learn the structure of the graph might be inadequate to capture the complex relationships in the knowledge graph. Recent work on LSTM aggregators has shown strong performance on low-dimensional task such as node classification, but has not been explored for complex zero-shot learning tasks Murphy et al. (2019).
We propose to learn class representations with non-linear graph aggregators: either LSTMS or a novel approach using transformers. Figure 1 shows the high-level architecture of our aggregator.
LSTM Aggregator. LSTMs Hochreiter and Schmidhuber (1997) are used for learning sequences—often in language tasks—where they take sequence of words or features as input to make predictions. This means that LSTMs are not permutation invariant i.e. the order the node features can affect the output. To overcome this limitation, we randomly permute the neighbourhood nodes while generating the aggregated vector . The nodes in the neighbourhood are passed through the LSTM to obtain the aggregated vector . Then, concatenate the aggregated vector with the node’s feature from the previous layer and pass through the linear layer followed by a non-linearity .
where is the learnable LSTM network and is a learnable projection weight matrix for the -th layer of the graph neural network.
Transformer Aggregator. Transformers Vaswani et al. (2017) take the entire input sequence to generate hidden states for each input feature. We take advantage of this property to achieve non-linear aggregation of the neighbourhood features. To make transformers permutation invariant, we simply do not add the sinusoidal positional embedding to the input. The nodes in the neighbourhood are passed through the transformer, the non-linear aggregator, to obtain the aggregated vector . The aggregated vector is passed through a linear layer followed by a non-linearity
where is learnable transformer, is a pooling function such as mean-pooling which combines the node features to get aggregated vector and is a learnable projection weight matrix.
Variants of graph neural networks that use relational information to learn the graph structure such as RGCN use linear aggregators Marcheggiani and Titov (2017); Schlichtkrull et al. (2018). In section 4.4, we show that the LSTM and transformer aggregators significantly outperform other existing relational graph neural networks.
In our experiments, we use ConceptNet Speer et al. (2017) as our common sense knowledge graph but our approach can be adapted to other knowledge graphs. ConceptNet is a large scale knowledge graph with millions of nodes and edges, which poses a challenge to train the graph neural network. To solve this problem, we explored numerous neighbourhood sampling strategies. Existing work on sampling neighbourhood includes random sampling Hamilton et al. (2017a), importance sampling Chen et al. (2018b), random walks Ying et al. (2018), etc. Similar to PinSage Ying et al. (2018)
, we simulate random walks for the nodes in the graph and assign hit probability to the neighbourhood nodes. During training and testing the graph neural network, we select the topnodes from the neighbourhood based on their hitting probability.
4 Tasks and Results
We evaluate our framework on three zero-shot learning tasks: intent classification, fine-grained entity typing, and object classification. First, we describe the general setup and preprocessing of ConceptNet. Next, we detail how we adapt zero-shot object classification methods to our tasks. Then, we introduce the individual tasks, datasets and report results on each of the task. Finally, we perform an ablation on the choice of neighbourhood aggregators in our framework. The code and hyperparameters are included in the supplementary material, which will be released upon acceptance.
ConceptNet Setup. In all our experiments, we map each class to a node in ConceptNet 5.7 Speer et al. (2017) and query its 2-hop neighbourhood. For example, a class politician gets mapped to /c/en/politician in the graph and its 2-hop neighbourhood is queried. Then, we remove all the non-English concepts and their edges from the graph and make all the relations bidirectional. For fine-grained entity typing and object classification, we also take the union of the concepts’ neighbourhood that share the same prefix. For example, we take the union of the /c/en/politician and /c/en/politician/n. Then, we compute the embeddings for the concept using the pretrained GloVe 840B Pennington et al. (2014). We average the individual words in the concept to get the embedding. These embeddings serve as initial features for the graph neural network. We ran the random walks on train and test classes separately so no information about the identity of the test classes leaked.
Experimental Setup. We evaluate ZSL-KG with an LSTM aggregator (ZSL-KG-LSTM) and a Transformer aggregator (ZSL-KG-Tr). We use two-layer graph neural networks, corresponding to two hops around each class node. During training and testing, we pick the 50 and 100 node neighbours with the highest hitting probabilities for the first and the second hop, respectively.
We adapt GCNZ, SGCN, and DGP for zero-shot learning tasks in language. GCNZ Wang et al. (2018) uses symmetrically normalized graph Laplacian to generate the class representations. SGCN Kampffmeyer et al. (2019) uses an asymmetrical normalized graph Laplacian to learn the class representations. Finally, DGP Kampffmeyer et al. (2019)
exploits the hierarchical graph structure and avoids dilution of knowledge from intermediate nodes. They use a dense graph connectivity scheme with a two-stage propagation from ancestors and descendants to learn the class representations. We mapped the classes to the nodes in the WordNet graph for each dataset. On the language tasks, all three methods use two-layer graph neural networks. On the computer vision task, we use the code obtained fromKampffmeyer et al. (2019) to replicate the methods.
4.1 Intent Classification
Intent Classification is the task of identifying users’ intent expressed in chatbots and personal voice assistants. Developing zero-shot learning systems for intent classification can allow existing models to adapt to emerging intents in personal assistants.
We evaluate on the main open-source benchmark for intent classification: SNIPS-NLUCoucke et al. (2018). The dataset was collected using crowdsourcing to benchmark the performance of voice assistants.
Intent classification is a zero-shot multi-class classification task. Since the number of classes is small, we use the bilinear similarity architecture for zero-shot learning. The example encoder is a BiLSTM with attention. The words in the example are represented with GloVe 840B. The training set has 5 classes which we split into 3 train classes and 2 development classes. We trains for 10 epochs by minimizing the cross entropy loss and pick the model with the least loss on the development set. We measure prediction accuracy.
We compare ZSL-KG against existing state-of-the-art approaches in the literature for intent classification: DeViSE Frome et al. (2013), IntentCapsNet Xia et al. (2018), and ResCapsNet-ZS Liu et al. (2019). DeViSE uses pretrained embeddings as class representations. IntentCapsNet and ResCapsNet-ZS are CapsuleNet Sabour et al. (2017) based approaches and have reported the best performance on the task.
|DeViSE Frome et al. (2013)||74.47|
|IntentCapsNet Xia et al. (2018)||77.52|
|ReCapsNet-ZS Liu et al. (2019)||79.96|
|GCNZ Wang et al. (2018)||82.47 03.09|
|SGCN Kampffmeyer et al. (2019)||50.27 14.13|
|DGP Kampffmeyer et al. (2019)||64.41 12.87|
Results for intent classification on the SNIPS-NLU dataset. We report the average performance of the models on 5 random seeds and the standard error. The results for DeViSE, IntentCapsNet and ReCapsNet-ZS are obtained fromLiu et al. (2019)
Results. Table 1
shows the results. Both variants of our framework significantly outperform the existing approaches and improves the state-of-the-art accuracy to 88.98%. The results for the baselines from zero-shot object classification indicates that GCNZ perform slightly better than existing benchmarks. Finally, SGCN and DGP performs poorly and shows high variance on the task.
4.2 Fine-Grained Entity Typing
Fine-grained entity typing is the task of classifying named entities into one or more narrowly scoped semantic types. Identifying fine-grained types of named entities has shown to improve to downstream performance in relation extraction Yaghoobzadeh and Schütze (2015), question answering Yavuz et al. (2016) and coreference resolution Durrett and Klein (2014).
. The datasets are collected using a combination of distant supervision from knowledge bases and heuristics to label the entities.
Both datasets are traditionally used in a supervised setting. OTyper Yuan and Downey (2018) created a zero-shot learning dataset for FIGER, where they divided their classes into 10 folds of train, development and test classes. Similar to FIGER, we convert Ontonotes into a zero-shot learning dataset and split the classes into multiple folds. See Section B.2 for more details.
Experiment. We reconstructed OTyper Yuan and Downey (2018) and DZET Obeidat et al. (2019), the state-of-the-art methods for this task. Both methods use the AttentiveNER biLSTM Shimaoka et al. (2017) as the example encoder. See Section B.1 for more details. Otyper averages the GloVe embeddings for the words in the name of each class to represent it. For DZET, we manually mapped the classes to Wikipedia articles. We pass each article’s first paragraph through a learnable biLSTM to obtain the class representations.
We train each model for 5 epochs by minimizing the cross-entropy loss and pick the model with the least loss on the development set. We evaluate the performance on the unseen classes by computing the micro average strict accuracy and macro average strict accuracy across all the folds. See Section B.3 for more details.
Results. Table 2 shows results of experiments on fine-grained entity typing. Our results show that ZSL-KG-LSTM significantly outperforms the state-of-the-art methods. ZSL-KG-Tr does not perform as well as ZSL-KG-LSTM on the task, but is also competitive with the state of the art. We also observe that methods from zero-shot object classification do not perform as well on fine-grained entity typing compared to other existing specialized methods from the language community.
|Mic. Avg.||Mac. Avg.||Mic. Avg.||Mac. Avg.|
|OTyper Yuan and Downey (2018)||54.39 1.97||58.08 0.75||33.38 0.02||31.67 1.44|
|DZET Obeidat et al. (2019)||50.12 1.48||50.15 1.49||37.37 3.95||34.49 3.61|
|GCNZ Wang et al. (2018)||37.43 1.78||48.41 1.13||38.07 1.71||40.22 1.11|
|SGCN Kampffmeyer et al. (2019)||41.58 1.02||53.92 1.78||22.15 1.27||29.21 1.91|
|DGP Kampffmeyer et al. (2019)||39.58 1.26||49.55 1.60||27.38 0.52||31.29 1.31|
|ZSL-KG-LSTM||68.95 1.49||65.57 1.30||44.66 2.40||42.72 1.95|
|ZSL-KG-Tr||52.37 2.75||54.96 1.46||34.46 2.25||32.63 2.07|
4.3 Object Classification
To assess ZSL-KG’s versatility, we also consider object classification, a computer vision task.
Datasets. We evaluate on the Animals with Attributes 2 (AWA2) dataset Xian et al. (2018). AWA2 contains images of animals with 40 classes in the train and validation sets, and 10 classes in the test set. Each class is annotated with 85 attributes, such as whether it as stripes, lives in water, etc.
Experiment. Following prior work Kampffmeyer et al. (2019); Wang et al. (2018), here we use the L2 loss architecture for zero-shot learning. The example encoder and seen class representations come from the ResNet 50 model He et al. (2016) in Torchvision Marcel and Rodriguez (2010) pretrained on ILSVRC 2012 Russakovsky et al. (2015). We map the the ILSVRC 2012 training and validation classes, and the AWA2 test classes to ConceptNet. The model is trained on 950 random classes and the remaining 50 ILSVRC 2012 the classes are used for validation. We use the same setting for SGCN and DGP using the authors’ implementation. The model with the least loss on the validation classes is used to make predictions on the test classes. Again following prior work, we predict on the 10 test classes of the updated split Xian et al. (2018) and report averaged per-class accuracy.
|GCNZ Wang et al. (2018)||70.7|
|SGCN Kampffmeyer et al. (2019)||74.24 1.67|
|DGP Kampffmeyer et al. (2019)||74.30 1.10|
Results. Table 3 shows the results for zero-shot object classification. ZSL-KG-Tr outperforms all other existing methods which learn class representations with graph neural networks. While ZSL-KG-LSTM performs relatively poorly on the task, it was clear from the validation loss during traing that ZSL-KG-Tr should be preferred on this task. ZSL-KG sets a new state of the art among methods which do not use the hand-engineered class attributes, and is 1 point away from the highest reported score among all methods Verma et al. (2020).
4.4 Comparison of Graph Aggregators
We conduct an ablation study with different aggregators with our framework. Existing the graph neural networks include - GCN Kipf and Welling (2017), GAT Veličković et al. (2018), and RGCN Schlichtkrull et al. (2018). GCN Kipf and Welling (2017) computes the mean of the node neighbours to learn the graph structure. GAT Veličković et al. (2018) computes the edge attention and applies it to the node features to learn the structure of the graph. RGCN Schlichtkrull et al. (2018) conditions the neighbourhood features with a learnable weight and then applies a weighted mean to learn the graph structure. We provide all the architectural details in section C. We train these models with the same experimental setting for the tasks mentioned in their respective sections.
|Accuracy||Mic. Avg.||Mic. Avg.||Accuracy|
|GCN Kipf and Welling (2017)||84.78 0.77||52.71 2.64||33.11 1.68||73.68 0.63|
|GAT Veličković et al. (2018)||87.57 1.59||56.71 1.07||31.12 1.97||74.70 0.78|
|RGCN Schlichtkrull et al. (2018)||87.47 1.81||65.60 1.90||35.67 5.55||65.10 1.01|
|ZSL-KG-LSTM||88.81 1.17||68.95 1.49||44.66 2.40||65.22 1.03|
|ZSL-KG-Tr||88.98 1.22||52.37 2.75||34.46 2.25||76.50 0.67|
Results. Table 4 shows results for our ablation study. Our results show that ZSL-KG outperforms existing graph neural networks with linear aggregators. With relational aggregators, we observe that they do not outperform non-linear aggregators and may reduce the overall performance (as seen in AWA2). Finally, our results on intent classification suggests that common sense knowledge graphs with even simple linear aggregators can outperform existing state-of-the-art benchmark results Liu et al. (2019).
5 Related Works
We broadly describe the related works for zero-shot learning and graph neural networks.
Zero-Shot Learning. Zero-shot learning has been thoroughly researched in the computer vision community for object classification Akata et al. (2015); Farhadi et al. (2009); Frome et al. (2013); Lampert et al. (2014); Wang et al. (2019); Xian et al. (2018). Recent works in zero-shot learning have used graph neural networks for object classification Kampffmeyer et al. (2019); Wang et al. (2018). In our work, we extend their approach to general-purpose common sense knowledge graphs to generate class representations. Other notable works, use generative methods for generalized zero-shot learning where both seen and unseen classes are evaluated at test time Kumar Verma et al. (2018); Schonfeld et al. (2019). But, these methods still rely on hand-crafted attributes for classification. Zero-shot learning for text classification is a well-studied problem Dauphin et al. (2013); Nam et al. (2016); Pappas and Henderson (2019); Zhang et al. (2019). Previously, ConceptNet has been used for transductive zero-shot text classification as a shallow soruce of knowledge for class represention Zhang et al. (2019). They use ConceptNet to generate a sparse vector which is combined with pretrained embeddings and description to obtain the class representation. On the other hand, we use ConceptNet to generate a dense vector representations from a graph neural network and use them as our class representation. Other line of work treats zero-shot text classification as a textual entailment problem and benchmarks their performance on several large-scale text classification dataset Yin et al. (2019). Fine-grained entity typing Choi et al. (2018); Gillick et al. (2014); Ling and Weld (2012); Shimaoka et al. (2017); Yogatama et al. (2015)
has been thoroughly researched for several years. However, limited approaches are tackling zero-shot fine-grained entity typing due to the lack of formalism in the task. Existing methods use different datasets and vary the evaluation metric. Most recent works in zero-shot learning for fine-grained entity typing have used a bilinear similarity model with a different class representation(Ma et al., 2016; Obeidat et al., 2019; Yuan and Downey, 2018).
Graph Neural Networks. Recent works on graph neural networks have demonstrated significant improvements for several downstream tasks such as node classification and graph classification Hamilton et al. (2017a, b); Kipf and Welling (2017); Veličković et al. (2018); Wu et al. (2019). Extensions of graph neural networks to relational graphs have produced significant results in several graph related tasksMarcheggiani and Titov (2017); Schlichtkrull et al. (2018); Shang et al. (2019); Vashishth et al. (2020). Training on large graphs is a challenging task and existing works in the field have explored sampling techniques to achieve the same performance with a subsampled graph Chen et al. (2018b, a); Zou et al. (2019); Ying et al. (2018). We use a random walk based approach to train our graph neural network. Furthermore, several diverse applications using graph neural networks have been explored: common sense reasoning Lin et al. (2019), fine-grained entity typing Xiong et al. (2019), text classification Yao et al. (2019)et al. (2020) et al. (2017). For a more in-depth review, we point readers to Wu et al. (2020).
In conclusion, we present ZSL-KG, a general-purpose framework based on graph neural networks to learn class representations from common sense knowledge graphs. We show that our framework outperforms existing state-of-the-art benchmarks in two challenging zero-shot learning tasks in language. Furthermore, we demonstrate the versatility of framework by beating the best methods that do not require hand-engineered class representation. Finally, our study comparing our framework with other linear graph neural networks shows that non-linear graph aggregators perform significantly better linear aggregators.
Deep neural networks require huge amounts of labeled training data to achieve strong performance. Labeling data is an expensive process as it requires humans to manually annotate the datasets. Our work on zero-shot learning aims to reduce the cost of labeling datasets. Our framework can be integrated into existing weak supervision pipelines such as SnorkelRatner et al. (2020)
, where ZSL-KG can be used as a source of supervision along with other labeling functions. Our framework can also be used to provide initial labels for active learning pipelines and can be continually improved with feedback.
Although reducing the cost of labeling data can appear to be a positive effect, many human annotators rely on labeling data as a source of income. Automating the labeling pipelines has a negative impact on their livelihood. But, jobs in the past have been automated by technology. For instance, ATMs automated the job of a cashier to a great extent. We suspect that the role of human annotators could a temporary phase during this machine learning boom and will eventually cease to exist. Furthermore, historical evidence suggests that technological automation in the past has not lead to long-term effects on the unemployment rates Pissarides (2019).
Our system can fail in two ways: (1) the zero-shot class is not a node in the graph (2) the zero-shot prediction is incorrect. In the first case, our model simply cannot predict and new nodes/edges need to be added to the graph for the model to work. However, the second case presents a greater ethical concern where our model’s predictions are incorrect. Our current system does not offer any explanation when the model predicts an incorrect label for an example. A possible future work could be to receive explanations for the predictions and use humans to make informed decisions in critical applications Lai and Tan (2019); Lipton (2018); Ribeiro et al. (2016).
There is a growing interest in the community to represent ‘implicit’ common sense knowledge as graphs. Our work hinges on the correctness of the knowledge graph. Common sense knowledge graphs are usually constructed by aggregating existing knowledge graphs and through crowdsourcing. This process can include offensive terms and associations in the knowledge graph, which can affect the downstream applications. For example, in figure 1, we can see that politician has an edge to statesman. This could introduce gender-related bias into our fine-grained entity typing model. Curating the knowledge graph by filtering offensive and biased nodes can be a possible solution. For instance, Conceptnet 5.8 filtered edges from their graph using metadata from wiktionary to reduce offensive terms. They also report no significant drop in performance on semantic benchmarks as these edges were not valuable.
We thank Yang Zhang for help preparing the Ontonotes dataset. We thank Roma Patel and Elaheh Raisi for providing helpful feedback on our work. This material is based on research sponsored by Defense Advanced Research Projects Agency (DARPA) and Air Force Research Laboratory (AFRL) under agreement number FA8750-19-2-1006. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Defense Advanced Research Projects Agency (DARPA) and Air Force Research Laboratory (AFRL) or the U.S. Government. We gratefully acknowledge support from Google. Disclosure: Stephen Bach is an advisor to Snorkel AI, a company that provides software and services for weakly supervised machine learning.
- Learning dynamic knowledge graphs to generalize on text-based games. arXiv preprint arXiv:2002.09127. Cited by: §5.
- Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). Cited by: §1, §5.
Graph convolutional encoders for syntax-aware neural machine translation.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP). Cited by: §5.
- Stochastic training of graph convolutional networks with variance reduction. International Conference on Machine Learning (ICML). Cited by: §5.
- Fastgcn: fast learning with graph convolutional networks via importance sampling. International Conference on Learning Representations (ICLR). Cited by: §3, §5.
- Ultra-fine entity typing. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Cited by: §5.
Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces.
ICML workshop on Privacy in Machine Learning and Artificial Intelligence. Cited by: item 3, §4.1.
- Zero-shot learning for semantic utterance classification. arXiv preprint arXiv:1401.0509. Cited by: §5.
- A joint model for entity analysis: coreference, typing, and linking. Transactions of the association for computational linguistics (TACL). Cited by: §4.2.
Describing objects by their attributes.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §5.
- Devise: a deep visual-semantic embedding model. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.1, §4.1, Table 1, §5.
- Context-dependent fine-grained entity type tagging. arXiv preprint arXiv:1412.1820. Cited by: item 3, §4.2, §5.
- Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §3, §5.
- Representation learning on graphs: methods and applications. IEEE Data Engineering Bulletin. Cited by: §2.2, §5.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3.
- Long short-term memory. Neural computation. Cited by: §1, §3.
- Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §1, §2.1, §2.2, §4.3, Table 1, Table 2, Table 3, §4, §5.
- Adam: a method for stochastic optimization. International Conference on Learning Representations (ICLR). Cited by: §D.1.
- Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §4.4, Table 4, §5.
Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1.
- Generalized zero-shot learning via synthesized examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
- On human predictions with explanations and predictions of machine learning models: a case study on deception detection. In Proceedings of FAT*, Cited by: Broader Impact.
- Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). Cited by: §1, §5.
- Kagnet: knowledge-aware graph networks for commonsense reasoning. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Cited by: §5.
- Fine-grained entity recognition. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §B.3, item 3, §4.2, §5.
- The mythos of model interpretability. ACM Queue. Cited by: Broader Impact.
- Reconstructing capsule networks for zero-shot intent classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: §4.1, §4.4, Table 1.
- ConceptNet—a practical commonsense reasoning tool-kit. BT technology journal. Cited by: §1.
- Label embedding for zero-shot fine-grained named entity typing. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Cited by: §5.
Torchvision: The machine-vision package of torch. In International Conference on Multimedia, Cited by: §4.3.
- Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3, §5.
- Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1.
- Janossy pooling: learning deep permutation-invariant functions for variable-size inputs. International Conference on Learning Representations (ICLR). Cited by: §1, §3.
- All-in text: learning document, label, and word representations jointly. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §5.
- Description-based zero-shot fine-grained entity typing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Cited by: §2.1, §4.2, Table 2, §5.
- GILE: a generalized input-label embedding for text classification. Transactions of the Association for Computational Linguistics. Cited by: §5.
- Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Cited by: §1, §2.2, §4.
- Over half of the world’s jobs are replaceable by a robot - is yours?. Cited by: Broader Impact.
- Snorkel: rapid training data creation with weak supervision. The VLDB Journal. Cited by: Broader Impact.
- " Why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, Cited by: Broader Impact.
- An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning (ICML), Cited by: §1.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV). Cited by: §4.3.
- Dynamic routing between capsules. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §4.1.
- Modeling relational data with graph convolutional networks. In European Semantic Web Conference, Cited by: §3, §4.4, Table 4, §5.
Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
- End-to-end structure-aware convolutional networks for knowledge base completion. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §5.
- Neural architectures for fine-grained entity type classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (EACL). Cited by: §4.2, §5.
- Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.1, §2.1.
- Conceptnet 5.5: an open multilingual graph of general knowledge. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1, §3, §3, §4.
- Webchild 2.0: fine-grained commonsense knowledge distillation. In Proceedings of ACL 2017, System Demonstrations, Cited by: §1.
- Composition-based multi-relational graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §5.
- Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §1, §3.
- Graph attention networks. International Conference on Learning Representations (ICLR). Cited by: §4.4, Table 4, §5.
- A meta-learning framework for generalized zero-shot learning. AAAI Conference on Artificial Intelligence (AAAI). Cited by: §4.3.
- A survey of zero-shot learning: settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology (TIST). Cited by: §1, §2.1, §5.
- Zero-shot recognition via semantic embeddings and knowledge graphs. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §1, §1, §2.1, §2.2, §4.3, Table 1, Table 2, Table 3, §4, §5.
- Simplifying graph convolutional networks. International Conference on Machine Learning (ICML). Cited by: §5.
- A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §5.
- Zero-shot user intent detection via capsule neural networks. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Cited by: §4.1, Table 1.
- Latent embeddings for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.1.
- Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). Cited by: item 3, §2.1, §2.1, §4.3, §4.3, §5.
- Imposing label-relational inductive bias for extremely fine-grained entity typing. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (NAACL). Cited by: §5.
- How powerful are graph neural networks?. In International Conference on Learning Representations (ICLR), Cited by: §2.2.
- Corpus-level fine-grained entity typing using contextual information. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Cited by: §4.2.
- Graph convolutional networks for text classification. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §5.
- Improving semantic parsing via answer type inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §4.2.
- Benchmarking zero-shot text classification: datasets, evaluation and entailment approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: §5.
- Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Cited by: §3, §5.
- Embedding methods for fine grained entity type classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Cited by: §2.1, §5.
- Otyper: a neural architecture for open named entity typing. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §A.2, §4.2, §4.2, Table 2, §5.
- TransOMCS: from linguistic graphs to commonsense knowledge. Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). Cited by: §1, §3.
- Integrating semantic knowledge to tackle zero-shot text classification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Cited by: §5.
- Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1.
- Few-shot representation learning for out-of-vocabulary words. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §5.
Appendix A Dataset Details
Here, we provide additional details and examples from the datasets in our experiments.
a.1 Intent Classification
|Can you please look up the game the islanders ?||Search|
|Is it going to get chillier at 5 am in trussville bosnia and herzegovina ?||Weather|
|Add song to my club hits||Playlist|
In our intent classification experiments, we benchmark our results on the SNIP-NLU dataset. Table 5 shows samples from the SNIPS-NLU dataset.
a.2 Fine-grained entity typing
Table 6 shows examples from the fine-grained entity typing datasets. FIGER dataset has 113 classes in the dataset. We pick 41 classes and divide them into 10 folds of train, development and test sets Yuan and Downey . During training, we remove classes from the examples which are not included in the training split for the fold. Similarly, Ontonotes (after postprocessing in section B.2) has 86 classes in the dataset. Then, we pick 35 classes and divide them into 7 folds of train, development and test sets.
Appendix B Fine-grained Entity Typing Details
We describe AttentiveNER that we use to represent the example in fine-grained entity typing task. Each mention comprises of tokens which are individually mapped to their respective pretrained word embedding. We average these embeddings to obtain a single vector . Formally, we can define vector as:
where and . We also need to learn the context of the mention. The left context is represented by and the right context by where and are the word embeddings for the left and the right context respectively and is the window size for the context. We pass and to the BiLSTM separately to obtain the hidden vectors , for the left context and , for the right context.
We then pass the hidden vectors to the attention layer. The attention layer is a 2 layer feedforward neural network and computes the attention for each of the hidden vectors. The attention values are normalized and used to compute the weighted sum of the hidden vectors to obtain the context vector . Formally, we describe the operations below:
We normalize the scalar attention values:
The scalar values are multiplied with their respective hidden vectors to get the final context vector representation :
Apart from the mention representation and context representation, we learn the hand-crafted feature representation from the mention’s syntactic, word-form and topic features. Finally, we concatenate the context vector , and to get the input representation .
b.2 OntoNotes Setup
We observed several challenges with Ontonotes. Ontonotes has the class /other associated with the named-entities, making the named-entity ambiguous. For example, “Ducks were the most precious asset in our village.” and “A kitchen knife ; a knife from your kitchen at home.” – the entity Ducks and knife are labeled with /other. Furthermore, it is not clear which is the correct concept for /other label as it an all encompassing word. To solve this issue, we remove /other labels from the dataset and rename all its subclasses without the /other prefix. We choose 35 classes from the list of 86 labels as the test classes. Like FIGER, the 35 classes are split into 7 folds of test for zero-shot learning.
b.3 Evaluation of fine-grained entity typing
Our fine-grained entity typing setup has multiple folds for OntoNotes and FIGER. Furthermore, the task is a multi-label classification problem. Stemming from existing research in fine-grained entity typing Ling and Weld , we modify the strict accuracy metric. We introduce micro average strict accuracy and macro average strict accuracy to evaluate the performance of our model across multiple folds.
Micro average strict accuracy =
and Macro average strict accuracy =
where corresponds to the folds, are examples in the test set with unseen classes for the fold and is the indicator function.
Appendix C Graph Neural Networks Architecture Details
In our work, we compare ZSL-KG framework with multiple graph neural network architectures, namely - GCN, GAT and RGCN.
GCN uses a mean aggregator to learn the neighbourhood structure. GAT projects the neighbourhood nodes to a new features . The neighbourhood node features are concatenated with self feature and passed through a self-attention module for get the attention coefficients. The attention coefficients are multiplied with the neighbourhood features to the get the node embedding for the -th layer in the combine function. RGCN uses a relational aggregator to learn the structure of the neighbourhood. To avoid overparameterization from the relational weights, we perform basis decomposition of the weight vector into bases. We learn relational coefficients and weight vectors in the aggregate function and add with the self feature in combine function.
Appendix D Hyperparameters
In this section, we detail the hyperparameters used in our experiments.
d.1 Training Details
Our framework is built using PyTorch and AllenNLP111https://allennlp.org/. In all our experiments, we use Adam Kingma and Ba  to train our parameters with a learning rate of 0.001. For intent classification, we experiment with a weight decay of 1e-05 and 5e-05. We found that weight decay of 5e-05 gives the best performance overall in intent classification for all the baseline graph aggregators. In intent classification, ZSL-KG-LSTM and ZSL-KG-Tr use weight decay of 5e-05 and 1e-05 respectively. We add a weight decay of 1e-05 for the OntoNotes experiments as it was a smaller dataset compared to FIGER. Finally, all experiments in zero-shot object classification have a weight decay of 5e-04.
The language tasks jointly train the parameters of the example encoder in our experiments, whereas, object classification experiments use a pretrained ResNet50 as the example encoder. They use a biLSTM with attention-based architecture as the example encoder. Furthermore, we assume a low-rank for the compatibility matrix . The matrix is factorized into and where is the low-rank dimension. Table 8 summarizes the hyperparameters used in the example encoders. Additionally, AttentiveNER has two other hyperparameters - and . uses a pretrained 300-dim GloVe embedding. learns a 60-dim feature vector during training. Lastly, in both intent classification and fine-grained entity typing, out-of-vocabulary words (oov) or words that do not have embeddings in GloVe are initialized randomly.
|Task||Inp. dim.||Hidden dim.||Attn. dim.||Low-rank dim.|
|Fine-grained entity typing||300||100||100||20|
In fine-grained entity typing, we have two baselines that do not use graph neural networks: OTyper and DZET. OTyper averages the GloVe embedding of 300-dim for the class representations. DZET uses a biLSTM with attention to learn the class representations. The architecture is the same as the biLSTM used to encode the context of the mention except without a window size. The input words are represented with 300-dim GloVe embeddings, which are passed to the biLSTM which has a hidden state dimension of 100. The hidden states from both the directions are concatenated to get a 200-dim hidden state vector, which is passed to an attention module. The attention module is a multilayer perceptron which learns two weight matrices of the dimension,and . The hidden states are multiplied with the scalar attention values to get the class representation.
d.2 Graph Aggregator Summary
|Fine-grained entity typing||128||128|
Table 9 describes the output dimensions of the node embeddings after each graph neural network layer. GCN, DGP, GCNZ, and SGN are linear aggregators and learn only one weight matrix in each of the layers. GAT learns a weight matrix for the attention where and uses LeakyReLU activation in the attention. LeakyReLU has a negative slope of 0.2. RGCN learns bases weight vectors in the baseline. We found that performs the best for fine-grained entity typing and object classification. For intent classification, we use 10 bases, i.e.,
. In intent classification and fine-grained entity typing, the non-linear activation function after the graph neural network layer is ReLU and in object classification the activation function is LeakyReLU with a negative slope of 0.2.
ZSL-KG-LSTM learns weight matrices inside the aggregator. In the LSTM, the dimension of the hidden state is the same as the input dimension. ZSL-KG-Tr is a complex architecture with numerous parameters. In our transformer module, there are five hyperparameters - input dimension, output dimension, feedforward layer hidden dimension, projection dimension, and the number of transformer layers. We use only one transformer layer in all experiments. The input dimension and output dimensions are the same in the aggregator. For instance, in intent classification, the first layer feature for a node is a 300-dim vector and the output dim is also a 300-dim vector. For the projection dimension and the feedforward hidden layer, we just take the half times the input dimension. For instance, in intent classification, the first layer projection dimension and the feedforward hidden layer dimension is 150, i.e. half times 300.
We simulate random walks for each node in the ConceptNet graphs and compute the hit probability for the nodes in the neighbourhood. The number of steps in the random walk is 20 and the number of restarts is 10. Finally, we add one smoothing to the visit counts and normalize the counts for the neighbouring nodes.