Technical Note: Bells and Smells and Replication

How to replicate the analyses in “Bells and smells…”

Replication is very much in vogue in psychology these days. To be honest, the field’s preoccupation with methodological navel gazing strikes me as being akin to worrying about the weeds on your lawn as your house burns down: We homo sapiens are defined by our capacity for learning, language and culture, and the fact that most researchers in the brain and cognitive sciences are poorly trained in the very things that serve to put the sapiens in homo seems like a far more pressing concern.

Still, I wouldn’t want to say that lawn weeds aren’t bothersome, or that replication is not important to the scientific process, and so in the spirit of supporting recent efforts to raise standards in the field, I decided to see if I could replicate the findings described in my “Bells and smells” post using word frequency counts taken from sources other than those used in the original.

(The data for those analyses are here, along with some helpful R code; the empirical data can be found here.)

In this replication, I estimated the background rates and blocking using frequencies taken from the 450 million word “Corpus of Contemporary English” (COCA; hence these are current frequency counts rather than counts that reflect usage contemporaneous to the study itself), while to estimate co-occurrence rates I used the number of hits retuned by a literal “w1 w2” search on the Bing search engine to get a rough measure of frequency (as you might imagine, it is difficult to estimate frequencies for pairs like “jury” and “eagle”).

Word frequency counts are not normally distributed. Indeed, they are notoriously skewed, such that when counts are plotted from highest to lowest, the frequency counts for a few of the most common items will dwarf those of all the other items. This property can clearly be seen in this plot of the raw frequencies for the first words in each of the PAL pairs (i.e., the number of times word1 appears among the 450 million words in COCA):


Statistical tests like regression assume that the variables entered into them are normally distributed. In their raw form, words frequencies violate the assumptions of these tests. However, if we apply a logarithmic transformation to our word frequency counts, we can make them approximate something more like a normal distribution:


(As you can see from the lowest frequency items, this distribution is still not perfectly normal — to explain word frequency distributions and the challenges they pose to analyses is a book in itself — but this will suffice for now.)

The log transformed frequency counts can then be entered into regressions to see how well the predict the item variance in the PAL data.

Using the COCA / Bing counts, all of the effects I reported in my “Bells and smells” post replicated (the R-squared values for both of the regression analyses were around .77).

The new data used in the replication:

Screen Shot 2014-07-27 at 2.31.44 PM

Nb. In gathering this data, I discovered that the counts on Bing had already been affected by the publication of our papers. (Language really is a living thing, and it makes studying it harder.) I’ve tried to minimize the effect of my blog on these counts, hence the lack of text in the table above.