Contains code for the EMNLP paper `Learning Linguistic Attributes for Zero-Shot Verb Classification'
In this paper, we investigate large-scale zero-shot activity recognition by modeling the visual and linguistic attributes of action verbs. For example, the verb "salute" has several properties, such as being a light movement, a social act, and short in duration. We use these attributes as the internal mapping between visual and textual representations to reason about a previously unseen action. In contrast to much prior work that assumes access to gold standard attributes for zero-shot classes and focuses primarily on object attributes, our model uniquely learns to infer action attributes from dictionary definitions and distributed word representations. Experimental results confirm that action attributes inferred from language can provide a predictive signal for zero-shot prediction of previously unseen activities.READ FULL TEXT VIEW PDF
Zero-shot video classification for fine-grained activity recognition has...
In this paper, we examined the zero-shot activity recognition task with ...
We present a generative framework for zero-shot action recognition where...
Many problems in image processing and computer vision (e.g. colorization...
Artificial intelligence is essential to succeed in challenging activitie...
Attribute based knowledge transfer has proven very successful in visual
Understanding crowd behavior in video is challenging for computer vision...
Contains code for the EMNLP paper `Learning Linguistic Attributes for Zero-Shot Verb Classification'
We study the problem of inferring action verb attributes based on how the word is defined and used in context. For example, given a verb such as “swig” shown in Figure 1, we want to infer various properties of actions such as motion dynamics (moderate movement), social dynamics (solitary act), body parts involved (face, arms, hands), and duration (less than 1 minute) that are generally true for the range of actions that can be denoted by the verb “swig.”
Our ultimate goal is to improve zero-shot learning
of activities in computer vision: predicting a previously unseen activity by integrating background knowledge about the conceptual properties of actions. For example, a computer vision system may have seen images of “drink” activities during training, but not “swig.” Ideally, the system should infer the likely visual characteristics of “swig” using world knowledge implicitly available in dictionary definitions and word embeddings.
However, most existing literature on zero-shot learning has focused on object recognition, with only a few notable exceptions (see Related Work in Section 8). There are two critical reasons: object attributes, such as color, shape, and texture, are conceptually straightforward to enumerate. In addition, they have distinct visual patterns which are robust for current vision systems to recognize. In contrast, activity attributes are more difficult to conceptualize as they involve varying levels of abstractness, which are also more challenging for computer vision as they have less distinct visual patterns. Noting this difficulty, Antol et al. (2014) instead employ cartoon illustrations as intermediate mappings for zero-shot dyadic activity recognition. We present a complementary approach: that of tackling the abstractness of verb attributes directly. We develop and use a corpus of verb attributes, using linguistic theories on verb semantics (e.g., aspectual verb classes of Vendler (1957)) and also drawing inspiration from studies on linguistic categorization of verbs and their properties (Friedrich and Palmer, 2014; Siegel and McKeown, 2000).
In sum, we present the first study aiming to recover general action attributes for a diverse collection of verbs, and probe their predictive power for zero-shot activity recognition on the recently introduced imSitu dataset (Yatskar et al., 2016). Empirical results show that action attributes inferred from language can help classifying previously unseen activities and suggest several avenues for future research on this challenging task. We publicly share our dataset and code for future research.111Available at http://github.com/uwnlp/verb-attributes
We consider seven different groups of action verb attributes. They are motivated in part by potential relevance for visual zero-shot inference, and in part by classical literature on linguistic theories on verb semantics. The attribute groups are summarized below.222The full list is available in the supplemental section. Each attribute group consists of a set of attributes, which sums to distinct attributes annotated over the verbs.333Several of our attributes are categorical; if converted to binary attributes, we would have 40 attributes in total.
We include the aspectual verb classes of Vendler (1957):
state: a verb that does not describe a changing situation (e.g. “have”, “be”)
achievement: a verb that can be completed in a short period of time (e.g. “open”, “jump”)
accomplishment: a verb with a sense of completion over a longer period of time (e.g. “climb”)
activity: a verb without a clear sense of completion (e.g. “swim”, “walk”, “talk”)
This attribute group relates to the aspectual classes above, but provides additional estimation of typical time duration with four categories. We categorize verbs by best-matching temporal units: seconds, minutes, hours, or days, with an additional option for verbs with unclear duration (e.g., “provide”).
This attribute group focuses on the energy level of motion dynamics in four categories: no motion (“sleep”), low motion (“smile”), medium (“walk”), or high (“run”). We add an additional option for verbs whose motion level depends highly on context, such as “finish.”
This attribute group focuses on the likely social dynamics, in particular, whether the action is usually performed as a solitary act, a social act, or either. This is graded on a 5-part scale from least social to either to most social
This attribute group focuses on whether the verb can take an object, or be used without. This gives the model a sense of the implied action dynamics of the verb between the agent and the world. We record three variables: whether or not the verb is naturally transitive on a person (“I hug her” is natural), on a thing (“I eat it”), and whether the verb is intransitive (“I run”). It should be noted that a small minority of verbs do not allow a person as an agent (“snow”).
This attribute group focuses on the effects of actions on agents and other arguments. For each of the possible transitivities of the verb, we annotate whether or not it involves location change (“travel”), world change (“spill”), agent or object change (“cry”) , or no visible change (“ponder”).
This attribute group specifies prominent body parts involved in carrying out the action. For example, “open” typically involves “hands” and “arms” when used in a physical sense. We use five categories: head, arms, torso, legs, and other body parts.
In general, contextual variations of action attributes are common, especially for frequently used verbs that describe everyday physical activities. For example, while “open” typically involves “hands”, there are exceptions, e.g. “open one’s eyes.” In this work, we focus on stereotypical or prominent characteristics across a range of actions that can be denoted using the same verb. Thus, three investigation points of our work include: (1) crowd-sourcing experiments to estimate the distribution of human judgments on the prominent characteristics of everyday physical action verbs, (2) the feasibility of learning models for inferring the prominent characteristics of the everyday action verbs despite the potential noise in the human annotation, and (3) their predictive power in zero-shot action recognition despite the potential noise from contextual variations of action attributes. As we will see in Section 7, our study confirms the usefulness of studying action attributes and motivates the future study in this direction.
The key idea in our work that action verbs project certain expectations about their influence on their arguments, their pre- and post-conditions, and their implications on social dynamics, etc., relates to the original Frame theories of Baker et al. (1998a). The study of action verb attributes is also closely related to formal studies on verb categorization based on the characteristics of the actions or states that a verb typically associates with (Levin, 1993), and cognitive linguistics literature that focus on causal structure and force dynamics of verb meanings (Croft, 2012).
In this section we present our models for learning verb attributes from language. We consider two complementary types of language-based input: dictionary definitions and word embeddings. The approach based on dictionary definitions resembles how people acquire the meaning of a new word from a dictionary lookup, while the approach based on word embeddings resembles how people acquire the meaning of words in context.
This task follows the standard supervised learning approach where the goal is to predictattributes per word in the vocabulary . Let represent the input representation of a word . For instance, could denote a word embedding, or a definition looked up from a dictionary (modeled as a list of tokens). Our goal is to produce a model that maps the input to a representation of dimension . Modeling options include using pretrained word embeddings, as in Section 3.1, or using a sequential model to encode a dictionary, as in Section 3.2.
Then, the estimated probability distribution over attributeis given by:
If the attribute is binary, then
is a vector of dimensionand
is the sigmoid function. Otherwise,is of shape , where is the dimension of attribute , and is the softmax function. Let the vocabulary be partitioned into sets and ; then, we train by minimizing the cross-entropy loss over and report attribute-level accuracy over words in .
Our task of learning verb attributes used in activity recognition is related to the task of learning object attributes, but with several key differences. Al-Halah et al. (2016) build the “Class-Attribute Association Prediction” model (CAAP) that classifies the attributes of an object class from its name. They apply it on the Animals with Attributes dataset, which contains 50 animal classes, each described by 85 attributes (Lampert et al., 2014)
. Importantly, these attributes are concrete details with semantically meaningful names such as “has horns” and “is furry.” The CAAP model takes advantage of this, consisting of a tensor factorization model initialized by the word embeddings of the object class names as well as the attribute names. On the other hand,verb attributes such as the ones we outline in Section 2, are technical linguistic terms. Since word embeddings principally capture common word senses, they are unsuited for verb attributes. Thus, we evaluate two versions of CAAP as a baseline: CAAP-pretrained, where where the model is preinitialized with GloVe embeddings for the attribute names (Pennington et al., 2014), and CAAP-learned, where the model is learned from random initialization.
One way of producing attributes is from distributed word embeddings such as word2vec (Mikolov et al., 2013). Intuitively, we expect similar verbs to have similar distributions of nearby nouns and adverbs, which can greatly help us in zero-shot prediction. In our experiments, we use 300-dimensional GloVe vectors trained on 840B tokens of web data (Pennington et al., 2014)
. We use logistic regression to predict each attribute, as we found that extra hidden layers did not improve performance. Thus, we let, the GloVe embedding of , and use Equation 1 to get the distribution over labels.
We additionally experiment with retrofitted embeddings, in which embeddings are mapped in accordance with a lexical resource. Following the approach of Faruqui et al. (2015), we retrofit embeddings using WordNet (Miller, 1995), Paraphrase-DB (Ganitkevitch et al., 2013), and FrameNet (Baker et al., 1998b).
We additionally propose a model that learns the attribute-grounded meaning of verbs through dictionary definitions. This is similar in spirit to the task of using a dictionary to predict word embeddings (Hill et al., 2016).
Our first model involves a Bidirectional Gated Recurrent Unit (BGRU) encoder(Cho et al., 2014). Let be a definition for verb , with tokens. To encode the input, we pass it through the GRU equation:
Let denote the output of a GRU applied on the reversed input . Then, the BGRU encoder is the concatenation .
Additionally, we try two common flavors of a Bag-of-Words model. In the standard case, we first construct a vocabulary of 5000 words by frequency on the dictionary definitions. Then,
represents the one-hot encoding, in other words, whether word appears in definiton for verb .
Additionally, we try out a Neural Bag-of-Words model where the GloVe embeddings in a definition are averaged (Iyyer et al., 2015). This is .
One potential pitfall with using dictionary definitions is that there are often many defnitions associated with each verb. This creates a dataset bias since polysemic verbs are seen more often. Additionally, dictionary definitions tend to be sorted by relevance, thus lowering the quality of the data if all definitions are weighted equally during training. To counteract this, we randomly oversample the definitions at training time so that each verb has the same number of definitions.444For the (neural) bag of words models, we also tried concatenating the definitions together per verb and then doing the encoding. However, we found that this gave worse results. At test time, we use the first-occurring (and thus generally most relevant) definition per verb.
We hypothesize that the two modalities of the dictionary and distributional embeddings are complementary. Therefore, we propose an early fusion (concatenation) of both categories. Figure 2 describes the GRU + embedding model–in other words, . This can likewise be done with any choice of definition encoder and word embedding.
Given learned attributes for a collection of activities, we would like to evaluate their performance at describing these activities from real world images in a zero-shot setting. Thus, we consider several models that classify an image’s label by pivoting through an attribute representation.
A formal description of the task is as follows. Let the space of labels be , partitioned into and . Let represent an image with label ; our goal is to correctly predict this label amongst verbs at test time, despite never seeing any images with labels in during training.
Generalization will be done through a lookup table , with known attributes for each . Formally, for each attribute we define it as:
For binary attributes, we need only one entry per verb, making a single column vector. Let our image encoder be represented by the map . We then use the linear map in Equation 1 to produce the log-probability distribution over each attribute . The distribution over labels is then:
where is a learned parameter that maps the image encoder to the attribute representation. We then train our model by minimizing the cross-entropy loss over the training verbs .
Our image encoder is a CNN with the ResNet-152 architecture (He et al., 2016)
. We use weights pretrained on ImageNet(Deng et al., 2009) and perform additional pretraining on imSitu using the classes . After this, we remove the top layer and set to be the 2048-dimensional image representation from the network.
Our model is similar to those of Akata et al. (2013) and Romera-Paredes and Torr (2015) in that we predict the attributes indirectly and train the model through the class labels.555Unlike these models, however, we utilize (some) categorical attributes and optimize using cross-entropy. It differs from several other zero-shot models, such as Lampert et al. (2014)’s Direct Attribute Prediction (DAP) model, in that DAP is trained by maximizing the probability of predicting each attribute and then multiplying the probabilities at test time. Our use of joint training the recognition model to directly optimize class-discrimination rather than attribute-level accuracy.
An additional method of doing zero-shot image classification is by using word embeddings directly. Frome et al. (2013) build DeVISE, a model for zero-shot learning on ImageNet object recognition where the objective is for the image model to predict a class’s word embedding directly. DeVISE is trained by minimizing
for each image . We compare against a version of this model with fixed GloVe embeddings .
Additionally, we employ a variant of our model using only word embeddings. The equation is the same as Equation 4, except using the matrix as a matrix of word embeddings: i.e., for each label we consider, we have
To combine the representation power of the attribute and embedding models, we build an ensemble combining both models. This is done by adding the logits before the softmax is applied:
A diagram is shown in Figure 3. We find that during optimization, this model can easily overfit, presumably by excessive coadaption of the embedding and attribute components. To solve this, we train the model to minimize the cross entropy of three sources independently: the attributes only, the embeddings only, and the sum, weighting each equally.
We additionally experiment with an ensemble of our model, combining predicted and gold attributes of . This allows the model to hedge against cases where a verb attribute might have several possible correct answers. A single model is trained; at test time, we multiply the class level probabilities of each together to get the final predictions.
|most frequent class||61.33||75.45||76.84||76.58||43.67||35.13||42.41||84.97||69.73|
|GloVe + framenet||67.42||80.79||86.27||76.58||49.68||50.32||44.94||88.19||75.95|
|GloVe + ppdb||67.52||80.75||85.89||76.58||51.27||50.95||43.99||88.21||75.74|
|GloVe + wordnet||68.04||81.13||86.58||76.90||54.11||50.95||43.04||88.34||76.37|
|[origin=c]90D+E||NBoW + GloVe||67.52||80.76||86.84||75.63||53.48||51.90||41.77||88.03||75.00|
|BoW + GloVe||63.15||77.89||84.11||77.22||49.68||34.81||38.61||86.18||71.41|
|BGRU + GloVe||68.43||81.18||86.52||76.58||56.65||53.48||41.14||88.24||76.37|
To evaluate our hypotheses on action attributes and zero-shot learning, we constructed a dataset using crowd-sourcing experiments. The Actions and Attributes dataset consists of annotations for 1710 verb templates, each consisting of a verb and an optional particle (e.g. “put” or “put up”).
We selected all verbs from the imSitu corpus, which consists of images representing verbs from many categories (Yatskar et al., 2016), then extended the set using the MPII movie visual description dataset and ScriptBase datasets, (Rohrbach et al., 2015; Gorinski and Lapata, 2015). We used the spaCy dependency parser (Honnibal and Johnson, 2015) to extract the verb template for each sentence, and collected annotations on Mechanical Turk to filter out nonliteral and abstract verbs. Turkers annotated this filtered set of templates using the attributes described in Section 2. In total, 1203 distinct verbs are included. The templates are split randomly by verb; out of 1710 total templates, we save 1313 for training, 81 for validation, and 316 for testing.
To provide signal for classifying these verbs, we collected dictionary definitions for each verb using the Wordnik API,666Available at http://developer.wordnik.com/ with access to American Heriatge Dictionary, the Century Dictionary, the GNU Collaborative International Dictionary, Wordnet, and Wiktionary. including only senses that are explicitly labeled “verb.” This leaves us with 23,636 definitions, an average of 13.8 per verb.
We pretrain the BGRU model on the Dictionary Challenge, a collection of 800,000 word-definition pairs obtained from Wordnik and Wikipedia articles (Hill et al., 2016); the objective is to obtain a word’s embedding given one of its definitions. For the BGRU model, we use an internal dimension of 300, and embed the words to a size 300 representation. The vocabulary size is set to 30,000 (including all verbs for which we have definitions). During pretraining, we keep the architecture the same, except a different 300-dimensional final layer is used to predict the GloVe embeddings.
Following Hill et al. (2016), we use a ranking loss. Let be the predicted word embeddings for each definition of a word in the dictionary (not necessarily a verb). Let be the word’s embedding, and be the embedding of a random dictionary word. The loss is then given by:
After pretraining the model on the Dictionary Challenge, we fine-tune the attribute weights using the cross-entropy over Equation 1.
We build our image-to-verb model on the imSitu dataset, which contains a diverse collection of images depicting one of 504 verbs. The images represent a variety of different semantic role labels (Yatskar et al., 2016). Figure 4 shows examples from the dataset. We apply our attribute split to the dataset and are left with 379 training classes, 29 validation classes, and 96 test classes.
We compare against several additional baseline models for learning from attributes and embeddings. Romera-Paredes and Torr (2015) propose “Embarassingly Simple Zero-shot Learning” (ESZL), a linear model that directly predicts class labels through attributes and incorporates several types of regularization. We compare against a variant of Lampert et al. (2014)’s DAP model discussed in Section 4.1.1. We additionally compare against DeVISE (Frome et al., 2013), as mentioned in Section 4.2. We use a ResNet-152 CNN finetuned on the imSitu classes as the visual features for these baselines (the same as discussed in Section 4.1).
are provided in the Appendix.
Our results for action attribute prediction from text are given in Table 1. Several examples are given in the supplemental section in Table 3. Our results on the text-to-attributes challenge confirm that it is a challenging task for two reasons. First, there is noise associated with the attributes: many verb attributes are hard to annotate given that verb meanings can change in context.777 As such, our attributes have a median Krippendorff Alpha of . Second, there is a lack of training data inherent to the problem: there are not many common verbs in English, and it can be difficult to crowdsource annotations for rare ones. Third, any system must compete with strong frequency-based baselines, as attributes are generally sparse. Moreover, we suspect that were more attributes collected (so as to cover more obscure patterns), the sparsity would only increase.
Despite this, we report strong baseline results on this problem, particularly with our embedding based models. The performance gap between embedding-only and definition-only models can possibly be explained by the fact that the word embeddings are trained on a very large corpus of real-world examples of the verb, while the definition is only a single high-level representation meant to be understood by someone who already speaks that language. For instance, it is likely difficult for the definition-only model to infer whether a verb is transitive or not (Transi.), since definitions might assume commonsense knowledge about the underlying concepts the verb represents. The strong performance of embedding models is further enhanced by using retrofitted word embeddings, suggesting an avenue for improvement on language grounding through better representation of linguistic corpora.
We additionally see that both joint dictionary-embedding models outperform the dictionary-only models overall. In particular, the BGRU+GloVe model performs especially well at determining the aspect and motion attributes of verbs, particularly relative to the baseline. The strong performance of the BGRU+GloVe model indicates that there is some signal that is missing from the distributional embeddings that can be recovered from the dictionary definition. We thus use the predictions of this model for zero-shot image recognition.
Based on error analysis, we found that one common mode of failure is where commonsense knowledge is required. To give an example, the embedding based model labels “shop” as a likely solitary action. This is possibly because there are a lack of similar verbs in ; by random chance, “buy” is also in the test set. We see that this can be partially mitigated by the dictionary, as evidenced by the fact that the dictionary-based models label “shop” as in between social and solitary. Still, it is a difficult task to infer that people like to “visit stores in search of merchandise” together.
Our results for verb prediction from images are given in Table 2. Despite the difficulty of predicting the correct label over 96 unseen choices, our models show predictive power. Although our attribute models do not outperform our embedding models and DeVISE alone, we note that our joint attribute and embedding model scores the best overall, reaching 18.10% in top-1 and 41.46% in top-5 accuracy when using gold attribute annotations for the zero-shot verbs. This result is possibly surprising given the small number of attributes () in total, of which most tend to be sparse (as can be seen from the baseline performance in Table 1). We thus hypothesize that collecting more activity attributes would further improve performance.
We also note the success in performing zero-shot learning with predicted attributes. Perhaps paradoxically, our attribute-only models (along with DAP) perform better in both accuracy metrics when given predicted attributes at test time, as opposed to gold attributes. Further, we get an extra boost by ensembling predictions of our model when given two sets of attributes at test time, giving us the best results overall at 18.15% top-1 accuracy and 42.17% top-5. Interestingly, better performance with predicted attributes is also reported by Al-Halah et al. (2016): predicting the attributes with their CAAP model and then running the DAP model on these predicted attributes outperforms the use of gold attributes at test time. It is somewhat unclear why this is the case–possibly, there is some bias in the attribute labeling, which the attribute predictor can correct for.
In addition to quantitative results, we show some zero-shot examples in Figure 4. The examples show inherent difficulty of zero-shot action recognition. Incorrect predictions are often reasonably related to the situation (“rub” vs “dye”) but picking the correct target verb based on attribute-based inference is still a challenging task.
Although our results appear promising, we argue that our model still fails to represent much of the semantic information about each image class. In particular, our model is prone to hubness: the overprediction of a limited set of labels at test time (those that closely match signatures of examples in the training set). This problem has previously been observed with the use of word embeddings for zero-shot learning (Marco and Georgiana, 2015) and can be seen in our examples (for instance, the over-prediction of “buy”). Unfortunately, we were unable to mitigate this problem in a way that also led to better quantitative results (for instance, by using a ranking loss as in DeVISE (Frome et al., 2013)). We thus leave resolving the hubness problem in zero-shot activity recognition as a question for future work.
Rubinstein et al. (2015) seek to predict McRae et al. (2005)’s feature norms from word embeddings of concrete nouns. Likewise, the CAAP model of Al-Halah et al. (2016) predicts the object attributes of concrete nouns for use in zero-shot learning. In contrast, we predict verb attributes. A related task is that of improving word embeddings using multimodal data and linguistic resources (Faruqui et al., 2015; Silberer et al., 2013; Vendrov et al., 2016). Our work runs orthogonal to this, as we focus on word attributes as a tool for a zero-shot activity recognition pipeline.
Though distinct, our work is related to zero-shot learning of objects in computer vision. There are several datasets (Nilsback and Zisserman, 2008; Welinder et al., 2010) and models developed on this task (Romera-Paredes and Torr (2015); Lampert et al. (2014); Mukherjee and Hospedales (2016); Farhadi et al. (2010)). In addition, Ba et al. (2015) augment existing datasets with descriptive Wikipedia articles so as to learn novel objects from descriptive text. As illustrated in Section 1, action attributes pose unique challenges compared to object attributes, thus models developed for zero-shot object recognition are not as effective for zero-shot action recognition, as has been empirically shown in Section 7.
In prior work, zero-shot activity recognition has been studied on video datasets, each containing a selection of concrete physical actions. The MIXED action dataset, itself a combination of three action recognition datasets, has 2910 labeled videos with 21 actions, each described by 34 action attributes (Liu et al., 2011). These action attributes are concrete binary attributes corresponding to low-level physical movements, for instance, “arm only motion,” “leg: up-forward motion.” By using word embeddings instead of attributes, Xu et al. (2017) study video activity recognition on a variety of action datasets, albeit in the transductive setting wherein access to the test labels is provided during training. In comparison with our work on imSitu, these video datasets lack broad coverage of verb-level classes (and for some, sufficient data points per class).
The abstractness of broad-coverage activity labels makes the problem much more difficult to study with attributes. To get around this, Antol et al. (2014) present a synthetic dataset of cartoon characters performing dyadic actions, and use these cartoon illustrations as internal mappings for zero-shot recognition of dyadic actions in real-world images. We investigate an alternative approach by using linguistically informed verb attributes for activity recognition.
Several possibilities remain open for future work. First, more attributes could be collected and evaluated, possibly integrating other linguistic theories about verbs. Second, future work could move beyond using attributes as a pivot between language and vision domains. In particular, since our experiments show that unsupervised word embeddings significantly help performance, it might be desirable to learn data-driven attributes in an end-to-end fashion directly from a large corpus or from dictionary definitions. Third, future research on action attributes should ideally include videos to better capture attributes that require temporal signals.
Overall, however, our work presents a strong early step towards zero-shot activity recognition, a relatively less studied task that poses several unique challenges over zero-shot object recognition. We introduce new action attributes motivated by linguistic theories and demonstrate their empirical use for reasoning about previously unseen activities.
We thank the anonymous reviewers along with Mark Yatskar, Luke Zettlemoyer, Yonatan Bisk, Maxwell Forbes, Roy Schwartz, and Mirella Lapata, for their helpful feedback. We also thank the Mechanical Turk workers and members of the XLab, who helped with the annotation process. This work is supported by the National Science Foundataion Graduate Research Fellowship (DGE-1256082), the NSF grant (IIS-1524371), DARPA CwC program through ARO (W911NF-15-1-0543), and gifts by Google and Facebook.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 819–826.
Predicting deep zero-shot convolutional neural networks using textual descriptions.In ICCV.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for Computational Linguistics.
Retrofitting Word Vectors to Semantic Lexicons.Association for Computational Linguistics.
Scikit-learn: Machine learning in python.Journal of Machine Learning Research, 12(Oct):2825–2830.
Our CNN and BGRU models are built in PyTorch888pytorch.org. All of our one-layer neural network models are built in Scikit-learn (Pedregosa et al., 2011) using the provided LogisticRegression class (using one-versus-rest if appropriate). Our neural models use the Adam optimizer (Kingma and Ba, 2014)
, though we weak the default hyperparameters somewhat.
Recall that our dictionary definition model is a bidirectional GRU with a hidden size of 300, with a vocabulary size of 30,000. After pretraining on the Dictionary Challenge, we freeze the word embeddings and apply a dropout rate of before the final hidden layer. We found that such an aggressive dropout rate was necessary due to the small size of the training set. During pretraining, we used a learning rate of , a batch size of , and set the Adam parameter to the default . During finetuning, we set and the batch size to . In general, we found that setting too low of an during finetuning caused our zero-shot models to update parameters too aggressively during the first couple of updates, leading to poor results.
For our CNN models, we pretrained the ResNet-152 (initialized with Imagenet weights) on the training classes of the imSitu dataset, using a learning rate of and . During finetuning, we dropped the learning rate to and set . We also froze all parameters except for the final ResNet block, and the linear attribute and embedding weights. We also found L2 regularization quite important in reducing overfitting, and we applied regularization at a weight of to all trainable parameters.
|[origin = c]90shop||GT||To visit stores in search of merchandise or bargains||likely social||accomplish.||high||hours||arms,head|
|BGRU||solitary or social||activity||medium||minutes|
|BGRU+||solitary or social||activity||medium||minutes|
|[origin = c]90mash||GT||To convert malt or grain into mash||likely solitary||activity||high||seconds||arms|
|BGRU||solitary or social||achievement||medium||seconds||arms|
|[origin = c]90photograph||GT||To take a photograph of||solitary or social||achievement||low||seconds||arms,head|
|embed||solitary or social||accomplish.||medium||minutes||arms|
|BGRU||solitary or social||achievement||medium||seconds||arms|
|BGRU+||solitary or social||unclear||low||seconds||arms|
|[origin = c]90spew out||GT||eject or send out in large quantities also metaphorical||solitary or social||achievement||high||seconds||head|
|BGRU||ssolitary or social||achievement||high||seconds||arms|
|[origin = c]90tear||GT||To pull apart or into pieces by force rend||likely solitary||achievement||low||seconds||arms|
|embed||solitary or social||achievement||medium||seconds||arms|
|BGRU||solitary or social||achievement||high||seconds||arms|
|BGRU+||solitary or social||achievement||high||seconds||arms|
|[origin = c]90squint||GT||To look with the eyes partly closed as in bright sunlight||likely solitary||achievement||low||seconds||head|
|[origin = c]90shake||GT||To cause to move to and fro with jerky movements||solitary or social||activity||medium||seconds|
|[origin = c]90doze||GT||To sleep lightly and intermittently||likely solitary||state||none||minutes||head|
|[origin = c]90writhe||GT||To twist as in pain struggle or embarrassment||solitary or social||activity||high||seconds||arms,torso|
The following is a full list of the attributes. In addition to the attributes presented here, we also crowdsourced attributes for the emotion content of each verb (e.g., happiness, sadness, anger, and surprise). However, we found these annotations to be skewed towards “no emotion”, since most verbs do not strongly associate with a specific emotion. Thus, we omit them in our experiments.
Aspectual Classes: one attribute with 5 values:
Unclear without context
Temporal Duration: one attribute with 5 values:
On the order of seconds
On the order of minutes
On the order of hours
On the order of days
Motion Dynamics: One attribute with 5 values:
Unclear without context
Social Dynamics: One attribute with 5 values:
Solitary or social
Transitivity: Three binary attributes:
Intransitive: 1 if the verb can be used intransitively
Transitive (person): 1 if the verb can be used in the form “someone”
Transitive (object): 1 if the verb can be used in the form “verb something”
Effects on Arguments: 12 binary attributes
Intransitive 1: 1 if the verb is intransitive and the subject moves somewhere
Intransitive 2: 1 if the verb is intransitive and the external world changes
Intransitive 3: 1 if the verb is intransitive, and the subject’s state changes
Intransitive 4: 1 if the verb is intransitive, and nothing changes
Transitive (obj) 1: 1 if the verb is transitive for objects and the object moves somewhere
Transitive (obj) 2: 1 if the verb is transitive for objects and the external world changes
Transitive (obj) 3: 1 if the verb is transitive for objects and the object’s state changes
Transitive (obj) 4: 1 if the verb is transitive for objects and nothing changes
Transitive (person) 1: 1 if the verb is transitive for people and the object is a person that moves somewhere
‘Transitive (person) 2: 1 if the verb is transitive for people and the external world changes
Transitive (person) 3: 1 if the verb is transitive for people and if the object is a person whose state changes
Transitive (person) 4: 1 if the verb is transitive for people and nothing changes
Body Involements: 5 binary attributes
Arms: 1 if arms are used
Head: 1 if head is used
Legs: 1 if legs are used
Torso: 1 if torso is used
Other: 1 if another body part is used