LINGUIST List 9.1476

Thu Oct 22 1998

Sum: Addendum to GoldVarb Summary

Editor for this issue: Scott Fults <>


  1. Robert Sigley, GoldVarb (addendum)

Message 1: GoldVarb (addendum)

Date: Thu, 22 Oct 1998 15:39:17 +0300
From: Robert Sigley <>
Subject: GoldVarb (addendum)

In writing to Mario, I referred to (and included a copy of) Ch.7 of my
PhD thesis [Sigley, R. 1997. Choosing Your Relatives: Relative Clauses
in New Zealand English. PhD thesis, Victoria University of Wellington,
New Zealand.] This chapter compares logistic/Varbrul analysis with
more ordinary chi-squared tests on crosstabulated data; it's intended
as a practical guide to interpreting the GoldVarb output.

My email to Marco was a summary of that material, with additional
speculations, one of which was certainly wrong as stated (see below).
I write now so that anyone wishing to discuss details with me can do so
directly (email:

(i) The number of degrees of freedom in a logistic or loglinear model =
(the number of independently estimated parameters - the number of fixed

Question: Is this equal to (number of factors) - (number of factor groups),
as Avila states, or to (number of factors + 1) - (number of factor groups)?
In other words, does the 'input weight' (which is also iteratively
estimated) count?

(ii) The comment I made in parentheses below is inaccurate.

>It is possible to use this method to incorporate several interaction
>effects into the model -- but it quickly becomes rather cumbersome, as you
>will often have to collapse distinctions in order to include the
>crossproduct factor group, and things get really messy when you need to
>consider several interactions involving the same factor group. (I think the
>best way to treat these is stepwise: if the most significant interaction is
>between groups 1 and 2, and you suspect there's also an interaction between
>groups 1 and 3, you can only approach it indirectly by comparing models
>containing 1*2, 3, 4,...n and 1*2*3, 4,...n. By contrast, if you try
>constructing a model containing 1*2, 1*3, 4,...n then you've effectively
>encoded the distinctions from group 1 twice, which means your model has
>redundant parameters and could produce unreliable results.)

Here I was trying to reconcile differences between what I know in theory
and what seems to work in practice, and managed a rather garbled account; a
fuller explanation follows.

Suppose we're comparing the models:

(a) 1*2, 3, 4, ... , n (a model containing the interaction effect between
groups 1 and 2, but treating every other factor group as independent)

(b) 1*2, 1*3, 4, ... , n ( a model containing independent interactions
between groups 1 and 2, and groups 1 and 3)

(c) 1*2, 1*3, 2*3, 4, ... , n (containing independent 2-way interactions
for groups 1 and 2, 1 and 3, 2 and 3)

(d) 1*2*3, 4, ... , n (containing the 3-way interaction for groups 1, 2 and 3)

In theory: 

To test the significance of adding the 1*3 interaction to a model
containing the 1*2 interaction, you should compare models (a) and (b).

To test the significance of further adding the 2*3 interaction, you should
compare models (b) and (c).

To test the significance of the 3-way 1*2*3 interaction, you should compare
models (c) and (d).

These models show increasing complexity, and an increasing number of
independently-estimated parameters, from (a) < (b) < (c) < (d).

In practice: this doesn't always work, for several reasons.

* Crossproducts often contain many apparently categorical environments
 ('knockouts') -- mostly because of low cell occupancy, but also because
 of systematic gaps -- which must be excluded or collapsed for analysis.
 Performing these simplifications sometimes produces nonsensical results.
 I've often found that a model containing a 3-way interaction contains
 *fewer* independently-estimated parameters than the supposedly
 'simpler' model containing the 3 2-way interactions -- once
 knockouts are excluded. Thus *in some cases* you won't be able to use
 the recommended model test, and some more indirect approach will be

* Crossproducts often contain a large number of factors. This may mean that
 the overall model has a higher number of parameters than is justified by
 the number of tokens in the dataset. Thus, accidental redundancy (where
 several combinations of factors describe the same set of tokens) may
 result. This is particularly likely when you include two factor groups
 based partly on the same distinctions (eg the 1*2, 1*3 crossproducts,
 which will both partition the dataset along the divisions from the
 original group 1). I must emphasise that including such crossproducts of
 shared factor groups does not necessarily result in redundancy (in contrast
 to what my original statement implied) -- but it does make it more likely.

 Robert Sigley.
| Robert Sigley, Foreign Languages Dept |
| (English Division), Daito Bunka University, |
| 1-9-1 Takashimadaira, Itabashi-ku, Tokyo 175 |

Mail to author|Respond to list|Read more issues|LINGUIST home page|Top of issue