1 Introduction
An important challenge in computational linguistics concerns the construction of largescale computational lexicons for the numerous natural languages where very large samples of language use are now available. Resnik:93 initiated research into the automatic acquisition of semantic selectional restrictions. Ribas:94 presented an approach which takes into account the syntactic position of the elements whose semantic relation is to be acquired. However, those and most of the following approaches require as a prerequisite a fixed taxonomy of semantic relations. This is a problem because (i) entailment hierarchies are presently available for few languages, and (ii) we regard it as an open question whether and to what degree existing designs for lexical hierarchies are appropriate for representing lexical meaning. Both of these considerations suggest the relevance of inductive and experimental approaches to the construction of lexicons with semantic information.
This paper presents a method for automatic induction of semantically annotated subcategorization frames from unannotated corpora. We use a statistical subcatinduction system which estimates probability distributions and corpus frequencies for pairs of a head and a subcat frame
[Carroll and Rooth1998]. The statistical parser can also collect frequencies for the nominal fillers of slots in a subcat frame. The induction of labels for slots in a frame is based upon estimation of a probability distribution over tuples consisting of a class label, a selecting head, a grammatical relation, and a filler head. The class label is treated as hidden data in the EMframework for statistical estimation.

0.0379 
0.0315 
0.0313 
0.0249 
0.0164 
0.0143 
0.0110 
0.0109 
0.0105 
0.0103 
0.0099 
0.0091 
0.0089 
0.0088 
0.0082 
0.0077 
0.0073 
0.0071 
0.0071 
0.0070 
0.0068 
0.0067 
0.0065 
0.0065 
0.0058 
0.0057 
0.0057 
0.0054 
0.0051 
0.0050 

number 
rate 
price 
cost 
level 
amount 
sale 
value 
interest 
demand 
chance 
standard 
share 
risk 
profit 
pressure 
income 
performance 
benefit 
size 
population 
proportion 
temperature 
tax 
fee 
time 
power 
quality 
supplely 
money 

0.0437  increase.as:s  
0.0392  increase.aso:o  
0.0344  fall.as:s  
0.0337  pay.aso:o  
0.0329  reduce.aso:o  
0.0257  rise.as:s  
0.0196  exceed.aso:o  
0.0177  exceed.aso:s  
0.0169  affect.aso:o  
0.0156  grow.as:s  
0.0134  include.aso:s  
0.0129  reach.aso:s  
0.0120  decline.as:s  
0.0102  lose.aso:o  
0.0099  act.aso:s  
0.0099  improve.aso:o  
0.0088  include.aso:o  
0.0088  cut.aso:o  
0.0080  show.aso:o  
0.0078  vary.as:s 
2 EMBased Clustering
In our clustering approach, classes are derived directly from distributional data—a sample of pairs of verbs and nouns, gathered by parsing an unannotated corpus and extracting the fillers of grammatical relations. Semantic classes corresponding to such pairs are viewed as hidden variables or unobserved data in the context of maximum likelihood estimation from incomplete data via the EM algorithm. This approach allows us to work in a mathematically welldefined framework of statistical inference, i.e., standard monotonicity and convergence results for the EM algorithm extend to our method. The two main tasks of EMbased clustering are i) the induction of a smooth probability model on the data, and ii) the automatic discovery of classstructure in the data. Both of these aspects are respected in our application of lexicon induction. The basic ideas of our EMbased clustering approach were presented in Rooth:95. Our approach contrasts with the merely heuristic and empirical justification of similaritybased approaches to clustering
[Dagan et al.1998] for which so far no clear probabilistic interpretation has been given. The probability model we use can be found earlier in Pereira:93. However, in contrast to this approach, our statistical inference method for clustering is formalized clearly as an EMalgorithm. Approaches to probabilistic clustering similar to ours were presented recently in Saul:97 and Hofmann:98. There also EMalgorithms for similar probability models have been derived, but applied only to simpler tasks not involving a combination of EMbased clustering models as in our lexicon induction experiment. For further applications of our clustering model see Rooth:98.

0.0148 
0.0084 
0.0082 
0.0078 
0.0074 
0.0071 
0.0054 
0.0049 
0.0048 
0.0047 
0.0046 
0.0041 
0.0040 
0.0040 
0.0039 
0.0039 
0.0039 
0.0039 
0.0038 
0.0038 
0.0038 
0.0037 
0.0035 
0.0035 
0.0035 
0.0034 
0.0033 
0.0033 
0.0033 
0.0033 

man 
ruth 
corbett 
doctor 
woman 
athelstan 
cranston 
benjamin 
stephen 
adam 
girl 
laura 
maggie 
voice 
john 
harry 
emily 
one 
people 
boy 
rachel 
ashley 
jane 
caroline 
jack 
burun 
juliet 
blanche 
helen 
edward 

0.0542  ask.as:s  
0.0340  nod.as:s  
0.0299  think.as:s  
0.0287  shake.aso:s  
0.0264  smile.as:s  
0.0213  laugh.as:s  
0.0207  reply.as:s  
0.0167  shrug.as:s  
0.0148  wonder.as:s  
0.0141  feel.aso:s  
0.0133  take.aso:s  
0.0121  sigh.as:s  
0.0110  watch.aso:s  
0.0106  ask.aso:s  
0.0104  tell.aso:s  
0.0094  look.as:s  
0.0092  give.aso:s  
0.0089  hear.aso:s  
0.0083  grin.as:s  
0.0083  answer.as:s 
We seek to derive a joint distribution of verbnoun pairs from a large sample of pairs of verbs
and nouns . The key idea is to view and as conditioned on a hidden class , where the classes are given no prior interpretation. The semantically smoothed probability of a pair is defined to be:The joint distribution is defined by . Note that by construction, conditioning of and on each other is solely made through the classes .
In the framework of the EM algorithm [Dempster et al.1977], we can formalize clustering as an estimation problem for a latent class (LC) model as follows. We are given: (i) a sample space of observed, incomplete data, corresponding to pairs from , (ii) a sample space of unobserved, complete data, corresponding to triples from , (iii) a set of complete data related to the observation , (iv) a completedata specification , corresponding to the joint probability over
, with parametervector
, (v) an incomplete data specification which is related to the completedata specification as the marginal probabilityThe EM algorithm is directed at finding a value of that maximizes the incompletedata loglikelihood function as a function of for a given sample , i.e.,
As prescribed by the EM algorithm, the parameters of are estimated indirectly by proceeding iteratively in terms of completedata estimation for the auxiliary function , which is the conditional expectation of the completedata loglikelihood given the observed data and the current fit of the parameter values (Estep). This auxiliary function is iteratively maximized as a function of (Mstep), where each iteration is defined by the map Note that our application is an instance of the EMalgorithm for contextfree models [Baum et al.1970], from which the following particularly simple reestimation formulae can be derived. Let , and the samplefrequency of . Then
Intuitively, the conditional expectation of the number of times a particular , , or choice is made during the derivation is prorated by the conditionally expected total number of times a choice of the same kind is made. As shown by Baum:70, these expectations can be calculated efficiently using dynamic programming techniques. Every such maximization step increases the loglikelihood function , and a sequence of reestimates eventually converges to a (local) maximum of .
In the following, we will present some examples of induced clusters. Input to the clustering algorithm was a training corpus of 1280715 tokens (608850 types) of verbnoun pairs participating in the grammatical relations of intransitive and transitive verbs and their subject and objectfillers. The data were gathered from the maximalprobability parses the headlexicalized probabilistic contextfree grammar of [Carroll and Rooth1998] gave for the British National Corpus (117 million words).
Fig. 2 shows an induced semantic class out of a model with 35 classes. At the top are listed the 20 most probable nouns in the distribution and their probabilities, and at left are the 30 most probable verbs in the distribution. 5 is the class index. Those verbnoun pairs which were seen in the training data appear with a dot in the class matrix. Verbs with suffix indicate the subject slot of an active intransitive. Similarily denotes the subject slot of an active transitive, and denotes the object slot of an active transitive. Thus in the above discussion actually consists of a combination of a verb with a subcat frame slot , , or . Induced classes often have a basis in lexical semantics; class 5 can be interpreted as clustering agents, denoted by proper names, “man”, and “woman”, together with verbs denoting communicative action. Fig. 1 shows a cluster involving verbs of scalar change and things which can move along scales. Fig. 5 can be interpreted as involving different dispositions and modes of their execution.
3 Evaluation of Clustering Models
3.1 PseudoDisambiguation
We evaluated our clustering models on a pseudodisambiguation task similar to that performed in Pereira:93, but differing in detail. The task is to judge which of two verbs and is more likely to take a given noun as its argument where the pair has been cut out of the original corpus and the pair is constructed by pairing with a randomly chosen verb such that the combination is completely unseen. Thus this test evaluates how well the models generalize over unseen verbs.
The data for this test were built as follows. We constructed an evaluation corpus of triples by randomly cutting a test corpus of 3000 pairs out of the original corpus of 1280712 tokens, leaving a training corpus of 1178698 tokens. Each noun in the test corpus was combined with a verb which was randomly chosen according to its frequency such that the pair did appear neither in the training nor in the test corpus. However, the elements , , and were required to be part of the training corpus. Furthermore, we restricted the verbs and nouns in the evaluation corpus to the ones which occurred at least 30 times and at most 3000 times with some verbfunctor in the training corpus. The resulting 1337 evaluation triples were used to evaluate a sequence of clustering models trained from the training corpus.
The clustering models we evaluated were parameterized in starting values of the training algorithm, in the number of classes of the model, and in the number of iteration steps, resulting in a sequence of models. Starting from a lower bound of 50 % random choice, accuracy was calculated as the number of times the model decided for out of all choices made. Fig. 3 shows the evaluation results for models trained with 50 iterations, averaged over starting values, and plotted against class cardinality. Different starting values had an effect of 2 % on the performance of the test. We obtained a value of about 80 % accuracy for models between 25 and 100 classes. Models with more than 100 classes show a small but stable overfitting effect.
3.2 Smoothing Power
A second experiment addressed the smoothing power of the model by counting the number of pairs in the set of all possible combinations of verbs and nouns which received a positive joint probability by the model. The space for the above clustering models included about 425 million combinations; we approximated the smoothing size of a model by randomly sampling 1000 pairs from and returning the percentage of positively assigned pairs in the random sample. Fig. 4 plots the smoothing results for the above models against the number of classes. Starting values had an influence of 1 % on performance. Given the proportion of the number of types in the training corpus to the space, without clustering we have a smoothing power of 0.14 % whereas for example a model with 50 classes and 50 iterations has a smoothing power of about 93 %.
Corresponding to the maximum likelihood paradigm, the number of training iterations had a decreasing effect on the smoothing performance whereas the accuracy of the pseudodisambiguation was increasing in the number of iterations. We found a number of 50 iterations to be a good compromise in this tradeoff.
4 Lexicon Induction Based on Latent Classes
The goal of the following experiment was to derive a lexicon of several hundred intransitive and transitive verbs with subcat slots labeled with latent classes.
4.1 Probabilistic Labeling with Latent Classes using EMestimation
To induce latent classes for the subject slot of a fixed intransitive verb the following statistical inference step was performed. Given a latent class model for verbnoun pairs, and a sample of subjects for a fixed intransitive verb, we calculate the probability of an arbitrary subject by:
The estimation of the parametervector can be formalized in the EM framework by viewing or as a function of for fixed . The reestimation formulae resulting from the incomplete data estimation for these probability functions have the following form ( is the frequency of in the sample of subjects of the fixed verb):
A similar EM induction process can be applied also to pairs of nouns, thus enabling induction of latent semantic annotations for transitive verb frames. Given a LC model for verbnoun pairs, and a sample of noun arguments ( subjects, and direct objects) for a fixed transitive verb, we calculate the probability of its noun argument pairs by:
Again, estimation of the parametervector can be formalized in an EM framework by viewing or as a function of for fixed . The reestimation formulae resulting from this incomplete data estimation problem have the following simple form ( is the frequency of in the sample of noun argument pairs of the fixed verb):
Note that the class distributions and for intransitive and transitive models can be computed also for verbs unseen in the LC model.

0.0385 
0.0162 
0.0157 
0.0101 
0.0073 
0.0071 
0.0063 
0.0060 
0.0060 
0.0057 
0.0055 
0.0052 
0.0052 
0.0051 
0.0050 
0.0050 
0.0048 
0.0047 
0.0047 
0.0046 
0.0041 
0.0041 
0.0040 
0.0040 
0.0039 
0.0038 
0.0037 
0.0037 
0.0036 
0.0036 

change 
use 
increase 
development 
growth 
effect 
result 
degree 
response 
approach 
reduction 
forme 
condition 
understanding 
improvement 
treatment 
skill 
action 
process 
activity 
knowledge 
factor 
level 
type 
reaction 
kind 
difference 
movement 
loss 
amount 

0.0539  require.aso:o  
0.0469  show.aso:o  
0.0439  need.aso:o  
0.0383  involve.aso:o  
0.0270  produce.aso:o  
0.0255  occur.as:s  
0.0192  cause.aso:s  
0.0189  cause.aso:o  
0.0179  affect.aso:s  
0.0162  require.aso:s  
0.0150  mean.aso:o  
0.0140  suggest.aso:o  
0.0138  produce.aso:s  
0.0109  demand.aso:o  
0.0109  reduce.aso:s  
0.0097  reflect.aso:o  
0.0092  involve.aso:s  
0.0091  undergo.aso:o 
4.2 Lexicon Induction Experiment
Experiments used a model with 35 classes. From maximal probability parses for the British National Corpus derived with a statistical parser [Carroll and Rooth1998], we extracted frequency tables for intransitive verb/subject pairs and transitive verb/subject/object triples. The 500 most frequent verbs were selected for slot labeling. Fig. 6 shows two verbs for which the most probable class label is 5, a class which we earlier described as communicative action, together with the estimated frequencies of for those ten nouns for which this estimated frequency is highest.
blush 5  0.982975  snarl 5  0.962094 

constance  3  mandeville  2 
christina  3  jinkwa  2 
willie  2.99737  man  1.99859 
ronni  2  scott  1.99761 
claudia  2  omalley  1.99755 
gabriel  2  shamlou  1 
maggie  2  angalo  1 
bathsheba  2  corbett  1 
sarah  2  southgate  1 
girl  1.9977  ace  1 
Fig. 7 shows corresponding data for an intransitive scalar motion sense of increase.
increase 17  0.923698  

number  134.147  proportion  23.8699 
demand  30.7322  size  22.8108 
pressure  30.5844  rate  20.9593 
temperature  25.9691  level  20.7651 
cost  23.9431  price  17.9996 
Fig. 8 shows the intransitive verbs which take 17 as the most probable label. Intuitively, the verbs are semantically coherent. When compared to Levin:93’s 48 toplevel verb classes, we found an agreement of our classification with her class of “verbs of changes of state” except for the last three verbs in the list in Fig. 8 which is sorted by probability of the class label.
Similar results for German intransitive scalar motion verbs are shown in Fig. 9. The data for these experiments were extracted from the maximalprobability parses of a 4.1 million word corpus of German subordinate clauses, yielding 418290 tokens (318086 types) of pairs of verbs or adjectives and nouns. The lexicalized probabilistic grammar for German used is described in Beil:99. We compared the German example of scalar motion verbs to the linguistic classification of verbs given by Schuhmacher:86 and found an agreement of our classification with the class of “einfache Änderungsverben” (simple verbs of change) except for the verbs anwachsen (increase) and stagnieren
(stagnate) which were not classified there at all.
0.977992  decrease  0.560727  drop 

0.948099  double  0.476524  grow 
0.923698  increase  0.42842  vary 
0.908378  decline  0.365586  improve 
0.877338  rise  0.365374  climb 
0.876083  soar  0.292716  flow 
0.803479  fall  0.280183  cut 
0.672409  slow  0.238182  mount 
0.583314  diminish 
0.741467  ansteigen  (go up) 
0.720221  steigen  (rise) 
0.693922  absinken  (sink) 
0.656021  sinken  (go down) 
0.438486  schrumpfen  (shrink) 
0.375039  zur ckgehen  (decrease) 
0.316081  anwachsen  (increase) 
0.215156  stagnieren  (stagnate) 
0.160317  wachsen  (grow) 
0.154633  hinzukommen  (be added) 
Fig. 10 shows the most probable pair of classes for increase as a transitive verb, together with estimated frequencies for the head filler pair. Note that the object label 17 is the class found with intransitive scalar motion verbs; this correspondence is exploited in the next section.
increase  0.3097650 

development  pressure  2.3055 
fat  risk  2.11807 
communication  awareness  2.04227 
supplementation  concentration  1.98918 
increase  number  1.80559 
5 Linguistic Interpretation
In some linguistic accounts, multiplace verbs are decomposed into representations involving (at least) one predicate or relation per argument. For instance, the transitive causative/inchoative verb increase, is composed of an actor/causative verb combining with a oneplace predicate in the structure on the left in Fig. 11. Linguistically, such representations are motivated by argument alternations (diathesis), case linking and deep word order, language acquistion, scope ambiguity, by the desire to represent aspects of lexical meaning, and by the fact that in some languages, the postulated decomposed representations are overt, with each primitive predicate corresponding to a morpheme. For references and recent discussion of this kind of theory see HaleKeyser:93 and Kural:96.
We will sketch an understanding of the lexical representations induced by latentclass labeling in terms of the linguistic theories mentioned above, aiming at an interpretation which combines computational learnability, linguistic motivation, and denotationalsemantic adequacy. The basic idea is that latent classes are computational models of the atomic relation symbols occurring in lexicalsemantic representations. As a first implementation, consider replacing the relation symbols in the first tree in Fig. 11 with relation symbols derived from the latent class labeling. In the second tree in Fig 11, and are relation symbols with indices derived from the labeling procedure of Sect. 4. Such representations can be semantically interpreted in standard ways, for instance by interpreting relation symbols as denoting relations between events and individuals.