1 Introduction
A frequent object of study in linguistic typology is the variation in the order of elements inside the noun phrase (NP) across languages. In particular, much work has focused on predicting the relative frequencies across languages of orders of the elements {demonstrative, adjective, numeral, noun}. Table 1 shows the relative frequencies of different orders for these elements across languages (assuming each language exhibits only one dominant order) according to data given in Dryer (in prep). In this table, stands for demonstrative, stands for numeral, stands for adjective, and stands for noun. Genera counts are the counts of linguistic genera showing a certain order; adjusted frequencies are calculated using a methodology described in Dryer (in prep) intended to minimize any overrepresentation of some orders that may arise from areal effects.
Order  Adjusted frequency  Genera count 

nAND  43.50  84 
DNAn  36.62  57 
DnAN  28.34  38 
DNnA  21.18  31 
NnAD  15.33  28 
nADN  14.78  19 
nDAN  9.00  11 
nNAD  9.00  9 
DnNA  8.77  10 
DAnN  6.11  8 
nDNA  4.67  5 
NAnD  4.00  5 
AnND  3.00  3 
NnDA  3.00  3 
NDAn  3.00  3 
AnDN  2.49  3 
DANn  2.00  2 
nNDA  1.00  1 
NADn  0.00  0 
NDnA  0.00  0 
ADnN  0.00  0 
ADNn  0.00  0 
ANDn  0.00  0 
ANnD  0.00  0 
Here we consider three proposals from the literature on how to explain these frequencies. We compare these proposals statistically in a loglinear framework based on how well they can predict the typological data given in Table 1.
We consider three proposals from the literature: those given in Dryer (in prep), Cysouw (2010) and Cinque (2005). The proposal in Dryer (in prep) is an update from the previous proposal of Dryer (2006). The first two of these theories are featural in nature: they associate each order with a set of marked features, and claim that orders with more marked features will be less frequent. The last model, that of Cinque (2005), is derivational in nature: it gives a generative model for how certain word orders arise, where certain decisions in the generative process are considered marked. Orders that require more marked operations to be generated are claimed to be less frequent. We reduce the last model to a featural model, and then compare which model provides a feature system which can best predict the typological data when the features have different degrees of markedness.
2 Method
2.1 Basics
We consider each proposal from the literature to define a feature system, and compare the ability of each feature system to predict the observed frequencies of orders. To do so, we use Poisson regression, as first used in Cysouw (2010). In Poisson regression we represent each language with a set of binaryvalued features, and say that the expected frequency of a language in a sample of languages is given by:
(1)  
where is an indicator variable with value when the th feature is , and when the th feature is , and where the weights and are those that maximize the probability of the observed counts of languages. The weight is called a bias term. Under the probabilistic model of Poisson regression, the probabability that a language with feature values has frequency is:  
(2) 
Fitting a Poisson regression model using a certain set of features to a set of (possibly adjusted) frequency tells us how well it is possible to predict languages in this framework given that set of features. Feature weights may be negative, in which case they can be considered marked. In that case, the model embodies the claim that the presence of these features is disfavored in languages. Since features get different weights, the model implements different degrees of markedness per feature, as in Harmonic Grammar (Smolensky and Legendre, 2006). The model finds the degrees of markedness which best predict the data given the feature system.
2.2 Feature systems under comparison
Within this framework, we compare three feature systems: (1) the system in Dryer (in prep), (2) the system in Cysouw (2010), and (3) the theory of Cinque (2005). Cinque’s theory is not phrased in terms of features, so we use two reductions of his theory to features: those presented in Merlo (2015) and our own, shown in Figure 5. Our featurization of Cinque’s theory closely parallels the featurization given in Cysouw (2010).
2.3 Dependent variables
We apply Poisson regression to predict two quantities. First, we try to predict the adjusted frequency of each order, as given in Dryer (in prep) and shown in Table 1. (We round the adjusted frequencies to the nearest integer in order to satisfy the Poisson regression assumption that the dependent variable is a natural number.) Second, we try to predict the counts of genera given in the same paper.
2.4 Basis for model comparison
We compare models using log likelihood, the log probability assigned to the observed frequencies under the model. A model fits the data well when it assigns high probability to the data, so high log likelihood indicates a good fit. When log likelihood for different models is close, we can also compare them by their degrees of freedom, which is the number of free parameters in the model. In general, simpler models with fewer parameters are preferable over ones with more parameters.
2.5 Notes on Featurization of Cinque (2005)
Special care is needed when reducing the theory of Cinque (2005) to features so that it can be compared with the other theories in a regression framework.
The theory of Cinque (2005) is not featural in nature, but rather derivational. In this model, orders are built up by a generative process that makes decisions in a certain order; whereas in featural models, orders are assigned scores based on features that have no intrinsic order. For example, the centerpiece of the theory of Cinque (2005) is the claim that the Merge order of DNAn is universal, and that the Linear Correspondence Axiom (LCA) of Kayne (1994) holds, such that every word order must be generated from a base structure of the form seen in Figure 1, plus movement operations. Orders that are not derivable from this base structure are not generated at all in Cinque’s theory. For such orders, the question of what if any movement operations apply never even arises under Cinque’s theory, in principle.
For example, suppose we think of the Cinque model in terms of features: then AlternativeMergeOrder is a feature that can be or , and some movement operation is reflected in a feature that can be or . For an order that violates the specified merge order, we would give it value for AlternativeMergeOrder and for the movement feature, but this is not a completely correct reflection of Cinque’s derivational theory. The reason is that under the generative process, if a word order violates the required merge order, then the model never even decides whether to perform a movement operation or not: thus the value of the movement feature should not be or , but undefined.
The theory of Cinque (2005) also involves theoreticallyderived graded markedness values for operations in the generative process: for instance, total movement is claimed to be unmarked, “picture of who” type movement is claimed to be especially marked, and the rest of the movement features are claimed to be marked. In our methodology, we let the model decide on feature weights (markedness values) without regard to these a priori markedness values. As such, it is possible that our implementation of the Cinque model does not reflect its full intent.
The fact that our weights are derived through fits to the data and not through a priori considerations is especially noteworthy in the case of the feature AlternativeMergeOrder in Cinque’s system. In a literal reading of Cinque (2005), orders which violate the required merge order should occur with probability 0, and thus the feature AlternativeMergeOrder should be assigned infinitely negative weight. In that case the model would assign probability 0 to any data that has nonzero frequency for any such order. In our work we let the model learn that AlternativeMergeOrder has a large negative weight, without postulating that orders violating the required merge order must have probability 0.
Here we represent Cinque’s model using features for the sake of convenience in statistical comparison, while noting that this introduces the issues above. And there is a further issue that should be noted. If Merge orders other than that seen in Figure 1 are allowed, then not only are otherwise impossible word orders generable by postulating that those word orders are generated as a base structure; additionally, word orders that were already generable under Cinque’s theory also wind up with new derivations from different base structures. Taking this multiplicity of possible derivations into account increases the complexity of the problem of inferring feature weights, and takes it outside the scope of standard Poisson regression or other generalized linear models. This is a difficulty shared by the modeling approaches of Cysouw (2010) and Merlo (2015). For expediency, however, we follow previous work in treating every word order not derivable from the DNAn base structure of Figure 1 as being derived by an alternative Merge order (so that AlternativeMergeOrder =T) with no movement, ignoring alternative derivations.
We also note that in formalizing Cinque’s (2005) theory into a featural description, we noticed certain unclarities in the text which affected our formalization. These are as follows (note that Cinque uses N where we use n, and Num where we use N):

Cinque describes order AnDN (his (k)) as involving two marked options: “derivation with raising of NP plus piedpiping of the picture of who type of the lowest modifier (A), followed by raising of [A N] without piedpiping around both Num and Dem” (emphasis ours). Technically speaking, it is not clear whether this latter raising (without piedpiping) should count as marked in his system, since the only relevant parameter of movement is “Movement of NP without pied piping” (his (7biii)), but this latter raising is movement of AP, not NP. Nevertheless we followed Cinque in assigning this word order a value for movement (of NP) without pied piping. This decision is also justified by Cinque’s comment (p. 320) that his system is “[k]eeping to the idea that…postnominal orders are only a function of the raising of the NP (or of an XP containing the NP)…”.
Additionally, there is an inconsistency in Cinque (2005) between (7bv), where this order is stated to involve partial movement, and (6k), where partial movement is not listed as a type of markedness for this order. Here we went with (7bv) and listed this order as involving
partial_move=+
; we believe that this treatment is the most globally consistent overall, on analogy with orders such as NnAD which Cinque treats as involving partial movement because there are multiple types of movement and the first (raising of NP around A) is only partial. 
As with AnDN, there is inconsistency between (7bv) and the wordorderspecific description of AnND (6w): in the former, this order is stated to involve partial movement of NP, but in the latter, partial movement is not mentioned as a type of markedness. As with AnDN, we listed this order as involving
partial_move=+
.
Regarding the first two cases, it should be emphasized that Cinque (2005) is far from totally clear about what does and does not count as partial (and thus marked) movement. For example, NnAD is described as involving partial (and thus marked) raising of NP around A, followed by a second raising that gets the raised constituent all the way to the left edge. But although nNAD likewise starts with a partial raising of NP around A (and N) followed by a second raising that gets the raised constituent all the way to the left edge, it is not considered to involve partial movement.
Additionally, there is one case of what we believe is a coding error by Cysouw (2010) in his implementation of Cinque’s model (see his Appendix on page 284):

Cysouw encodes nNAD as involving NP movement with pied piping of the picture of who variety, but Cinque describes this order ((6t)) as involving whose picture pied piping instead, which seems correct to us.
2.6 Comparison with Merlo (2015)
Merlo (2015)
conducts a study with similar aims to ours and uses featurizations more or less the same as what we’ve discussed above. She uses features to predict frequency classes using a Naive Bayes estimator and an Weighted Averaged OneDependence (WAODE) estimator, rather than Poisson regression as we use here and as was proposed by
Cysouw (2010). As a summary of how this works: Merlo (2015) first discretizes the integervalued word order frequency counts (by language or by genus) into 2, 4, or 7 categories; then she learns a model that categorizes language classes by their features according to the classic Naive Bayes formula:where is the value of the th feature in the featurization scheme under consideration; our here are Merlo’s in her Equations 1–4, p. 334. A word order is assigned to the frequency class that maximizes the probability above for that word order’s features. Technically these “features” are attributevalue pairs, such as Symmetry1=T for the Dryer model or Partial=whosepp for the Cinque model. The WAODE model is a bit more complex than the Naive Bayes model but is fundamentally similar.
These models are trained under two approaches: typebased, where training data consists of language types and their features and frequencies, and tokenbased, where the frequent language types are repeated multiple times in the training data and the language types with frequency zero appear with frequency zero. In the tokenbased approach, the model is not penalized for miscategorizing language types with frequency None, because these do not appear in the training data.
Our work differs from Merlo’s approach on two points:

In predicting typological data, we use Poisson regression, which is a discriminative loglinear predictor, rather than Naive Bayes and WAODE, which are generative models.

Our models predict integervalued typological counts, whereas the models in Merlo (2015) predict unordered categoricallyvalued frequency classes.
We favor Poisson regression (and more generally loglinear models) over the Naive Bayes/WAODE approach because it allows us to predict more finegrained typological data and to model the strong intuition that the effects of features on typological frequencies should be monotonic.
Merlo’s (2015) models have the property that feature weights are not monotonic in their preferences for frequency classes. For example, in a model where the goal is to classify each language into the categories (e.g.) {Very Frequent, Frequent, Rare, None}, there is nothing to prevent a feature from getting weights that favor Very Frequent and None while disfavoring Frequent and Rare. Examples of this nonmonotonicity in feature weights can be seen in Merlo’s (2015) Table 11, our Table
2: the feature Harmony=Y favors a language to be either Very Frequent or None, while favoring Frequent and Rare less. The monotonicity in weights means that the weights from this framework cannot be considered markedness values, which either penalize an order (make it less frequent) or do not. In addition to making the model weights less interpretable, this nonmonotonicity means that the model has the flexibility to take advantage of artifacts of the discretization of word order frequencies into bins.Probability  Value 

P(Very Frequent)  = 0.99 
P(Frequent)  = 0.16 
P(Rare)  = 0.51 
P(None)  = 0.55 
2.7 Comparison with Cysouw (2010)
As stated in Section 2.5, our approach here is very similar to that of Cysouw (2010): we use the same statistical model class and the same theory comparison (Cinque/Cysouw/Dryer). The differences are as follows:

We use the more recent data of Dryer (in prep) rather than the earlier data of Dryer (2006);

We use the feature set of Dryer (in prep) rather than the earlier feature set of Dryer (in prep);

We correct what we believe is a featurization error made by Cysouw in featurizing Cinque’s theory.
3 Results
The results do not give clear grounds for deciding between the Dryer model and the Cinque model, but both of these models come out better than the Cysouw (2010) model. Whether or not the Dryer model comes out better depends on whether we use the model to predict adjusted frequencies or genera counts.
3.1 Predicting Adjusted Frequencies
Table 3 shows log likelihoods for models predicting adjusted frequency (rounded to the nearest integer). It also shows the number of parameters (d.f.) in each model. The table shows that Cinque’s model slightly outperforms Dryer (in prep) on fitting the data.
Model  Log likelihood  d.f. 

Dryer (in prep)  54.3  6 
Cysouw (2010)  77.2  5 
Cinque (2005) (our features)  53.0  8 
Cinque (2005) (Merlo’s features)  56.5  8 
For a more detailed comparison of model performance, we compared model predictions to observed adjusted frequencies from Dryer (in prep). Figure 2 shows model predictions compared against adjusted frequency.
We wanted to know how much each order contributed to model fit, so in Figure 3 we show signed discrepancies between model predictions and adjusted frequency. The discrepancy measures how much the prediction error for each word order contributes to the overall discrepancy between data and model fit; signed discrepancy presents this discrepancy in the direction of the discrepancy for each order (whether it underpredicts or overpredicts). If a model predicts a count of for the th word order and the observed count is , then the signed discrepancy is:
The magnitude of the discrepancy corresponds to how much a model is penalized for failing to predict a certain order.
As another way to analyze the results, we show in Figures 4 and 5 the weights assigned to features for different word unders under the Dryer model and the Cinque model. Here we see that Cinque’s model works by strongly penalizing lowfrequency orders using the AlternativeMergeOrder feature, and then the differences among the remaining orders are handled by the rest of the features.
One limitation of applying Poisson regression to the adjusted frequency data is that Dryer (in prep)’s method of computing adjusted frequencies in general compresses high frequency counts more than low frequency counts. This means that adjusted frequency may overly penalize models that perform best at predicting the counts of common orders. For this reason, it is also important to evaluate the performance of the models under consideration in predicting genera counts. We turn to this matter in the next section.
3.2 Predicting Genera Counts
Now we turn to models that were trained to predict the genera count data given in Dryer (in prep). When we use the various feature systems to predict genera counts, we get the following data loglikelihoods, shown in Table 4:
Model  Log likelihood  d.f. 

Dryer (in prep)  61.3  6 
Cysouw (2010)  106.0  5 
Cinque (our features)  67.7  8 
Cinque (Merlo’s features)  71.1  8 
So when predicting genera, we get the best fit to the data using the set of features from Dryer (in prep), followed by Cinque’s (2005) features, followed by Cysouw’s (2010) features.
We think Cinque’s model comes out worse when predicting genera primarily because it underpredicts orders, whereas the Dryer model gets that order exactly correct. This can be seen in Figure 6, which shows model predictions, and Figure 7, which shows signed discrepancies compared to genera counts. Figures 8 and 9 show the optimal feature weights for the Dryer and Cinque models, respectively, when predicting genera counts.
4 Discussion
The results give clear evidence that the Dryer (in prep) and Cinque (2005) model provide feature systems that have better predictive power than the model of Cysouw (2010). But in our opinion they do not give strong reason to favor Cinque’s model over Dryer’s model or vice versa. Although under a certain interpretation Cinque’s model can provide a slightly higher fit to the data, this only holds under one featurization, and it does not hold when predicting genera counts. The discrepancy in results between adjusted frequency and genera counts may be due to the particular distributional characteristics of adjusted frequency as discussed above. The analysis suggests overall that the Dryer model and the Cinque model have roughly similar predictive power, and the current data do not discriminate between them.
Acknowledgments
This work was supported by NSF DDRI grant #1551543 to R.F.
References
 Cinque (2005) Cinque, G. (2005). Deriving Greenberg’s Universal 20 and its exceptions. Linguistic inquiry, 36(3):315–332.
 Cysouw (2010) Cysouw, M. (2010). Dealing with diversity: Towards an explanation of NPinternal word order frequencies. Linguistic Typology, 14(23):253–286.
 Dryer (2006) Dryer, M. S. (2006). On Cinque on Greenberg’s Universal 20.
 Dryer (prep) Dryer, M. S. (in prep). On the order of demonstrative, numeral, adjective and noun.
 Kayne (1994) Kayne, R. S. (1994). The Antisymmetry of Syntax. MIT Press, Cambridge, MA.
 Merlo (2015) Merlo, P. (2015). Predicting word order universals. Journal of Language Modelling, 3(2):317–344.
 Smolensky and Legendre (2006) Smolensky, P. and Legendre, G. (2006). The Harmonic Mind. MIT Press, Cambridge, MA.
Appendix A Feature weights
Feature  Weight  Std. Error  

(Bias)  3.9815  0.1175  .001 
icon1  1.5382  0.1910  .001 
icon2  1.3726  0.2177  .001 
asym  1.7200  0.4846  .001 
harmony  1.0936  0.1436  .001 
nadj  0.7480  0.1641  .005 
Feature  Weight  Std. Error  

(Bias)  3.8649  0.1165  .001 
na_adjacency  1.3835  0.1917  .001 
n_edge  0.6402  0.1415  .001 
d_edge  1.1876  0.1561  .001 
na_order  1.1096  0.1584  .001 
Feature  Weight  Std. Error  

(Bias)  3.61092  0.16440  .001 
AlternativeMergeOrder  3.71628  0.37167  .001 
whose_pic_move  0.06587  0.26465  .803 
np_move_no_pp  1.39843  0.23706  .001 
pic_of_who_move  1.33916  0.28958  .001 
partial_move  0.30165  0.32682  .356 
np_extraction  1.52919  0.36381  .001 
total_move  0.18137  0.35649  .611 
Feature  Weight  Std. Error  

(Bias)  3.39905  0.14848  .001 
AlternativeMergeOrder  3.51612  0.37858  .001 
partial_np  1.35184  0.23373  .001 
partial_ofwhopp  2.05580  0.35159  .001 
partial_whosepp  0.00645  0.17115  .970 
complete_np  1.17525  0.20835  .001 
complete_ofwhopp  1.16650  0.24160  .001 
complete_whosepp  0.29933  0.19247  .120 
Feature  Weight  Std. Error  

(Bias)  4.56458  0.09102  .001 
icon1  1.84600  0.17029  .001 
icon2  1.80865  0.20695  .001 
asym  1.77394  0.47645  .001 
harmony  1.25179  0.11740  .001 
nadj  0.86771  0.13439  .001 
Feature  Weight  Std. Error  

(Bias)  4.41097  0.09217  .001 
na_adjacency  1.77523  0.17740  .001 
n_edge  0.76628  0.11678  .001 
d_edge  1.24504  0.13132  .001 
na_order  1.22424  0.13152  .001 
Feature  Weight  Std. Error  

(Bias)  4.0431  0.1325  .001 
AlternativeMergeOrder  4.1484  0.3586  .001 
whose_pic_move  0.2471  0.2335  .290 
np_move_no_pp  1.9358  0.2081  .001 
pic_of_who_move  1.3523  0.2527  .001 
np_extraction  1.9767  0.3203  .001 
partial_move  0.2081  0.2835  .463 
total_move  0.5786  0.2956  .050 
Feature  Weight  Std. Error  

AlternativeMergeOrder  3.92246  0.37106  .001 
partial_np  1.79085  0.21596  .001 
partial_ofwhopp  2.40130  0.31357  .001 
partial_whosepp  0.03155  0.14208  .824 
complete_np  1.35753  0.18655  .001 
complete_ofwhopp  0.98344  0.18786  .001 
complete_whosepp  0.52716  0.15617  .001 
Comments
There are no comments yet.