What happens to our minds and memories in healthy ageing?

In our recent paper, The myth of cognitive decline, my colleagues and I suggest that the answer to this question is, “it’s complicated.” And if you think that the answer involves a steady deterioration of cognitive function, we present a series of findings that may make you think again.

Take, for instance, our ability to “retrieve” words from our memories: It’s widely believed that this ability declines as we get older. However, when we took a close look at the tests used to measure memory performance across the lifespan, we found that the truth of the matter is far less simple. Whether a word becomes harder (or easier) to recall with age can depend both on the kind of word tested, and the kind of test used. While many people find names increasingly hard to recall as they get older, on some tests of word recall, memory retrieval is unaffected by ageing. On other tests, performance actually improves with age. (I’ll talk more about this in more detail in a later post, but for now, Mark Liberman over at Language Log offers a lucid introduction to the way retrieval performance varies by test).

Not only did we find that a researcher’s choice of test can determine whether cognitive functioning appears to decline or improve with age, we also found that the results of the same cognitive test can suggest age-related declines or improvements, simply as a result of the context in which people are tested.

The methods we use to establish our findings are fairly complicated. For our journal article, we assumed that our readers would have a fairly high degree of scientific training, and so we kept the presentation of our methods and results brief. Because our results are likely to be of interest to people who have not had that training, I thought it might be helpful to write some expanded introductions to our work, so that our findings and their implications can be more widely understood.

Today I’m going to focus on the result that I personally found most surprising, namely that, if we take the scientific literature on aging at face value, then, if I were to test your ability to generate as many words beginning with F as you can in a minute, the number of words you actually manage to produce will depend on how many other people I intend to test. If I am intending to test a lot of other folks, you will manage to generate fewer F words than if I only intend to test a small number of people.

Unless you believe in E.S.P., there may seem to be something wrong with this result. How on earth can someone possibly know how many other people a given researcher intends to test? And if the person being tested can’t know – and I hope you’ll agree, unless she is told, it seems unlikely that anyone sitting down with a researcher to generate their “F” words is going to know – then how can the number of words that person generates depend on the number of other people the researcher intends to test?

The answers to these questions have some worrying implications for our current understanding of healthy cognitive ageing. But before we can start to think about those implications, it’s important we understand both the task itself, and the strange patterns of results arising from it in the current scientific literature.

The task

Like most standard tests of “cognitive ability,” the FAS task is remarkably simple. People are asked to generate as many words beginning with F as they can in 60 seconds, followed by as many words beginning with A in 60 seconds, followed by as many words that begin with S.

A couple of rules govern the words that are allowed as responses in the test: Proper names like Steve or France are not allowed, nor are different versions of the “same” word, such as friends, friendly, etc. (Also, sometimes the letters C R and F are used instead of F A and S. For the present purposes, I’m going to ignore this.)

How does performance on the FAS task change across adulthood

While the task could hardly be any simpler, the results of scientific studies that have used this test are complex.

In the published literature, people tend to agree that two things seem to influence how many words people are able to generate:

1. Education. This is straightforward: the more education someone has, the more F, A and S words they are able produce.

2. Less straightforwardly, Age.

Why is age less straightforward? Well, when researchers test just a few people, they seem to get better at this task as they get older. For example, in a recent experimental study of scrabble expertise, a group of 23 older adults with an average age of 57 produced around 25% more F A and S words than a group of 23 nineteen year-olds. (I should add that these 57 year-olds were not the experts – the 23 scrabble masters, also aged 57, produced nearly twice as many words as the 19 year olds!)

As I’ve described them so far, these two results seem pretty consistent with one another: the more words you know, whether from explicit study during education, or simply from living longer, the better you do at the FAS task. (And, as we explain in our article, and as I’ll explain at length at a later date, in English the relationship between the “legal” FAS words and the disallowed words (like proper nouns) not only predicts this, it also accurately predicts just how difficult the F A and S parts of the test ought to be.)

So when I came across a study looking at the results of 134 published scientific studies of ageing in which the FAS task was used, I was surprised to see that on average, these studies appeared to show that scores go down with age!

Why the contradiction?

To explain why these contradictory findings come about, I need to introduce you to a nifty data visualization tool, called a generalized additive mixed model (or GAMM), that my colleagues and I use in our article. GAMMs provide a neat way of plotting complex data, and they allow us to more easily see the relationships in data in more than two dimensions.

In the following GAMM visualisation (a version of which appears in our paper), I’ve taken each of the results from the 134 studies and plotted them by age and number of people tested, also known as the sample size. (Because the data are skewed towards either very big or very small samples, I’ve log transformed them to make the relations between the points easier to see). Each of the the points (circles) represents a study, and the average age tested in each study is plotted in the Y (vertical) axis. I’ve drawn lines on the graph so you can more easily see what is happening at ages 21, 30, 65 and 80: if you look along the “65” line, the points close to it represent studies in which the average age of the people tested was around 65 years old.


Sample size is plotted on the horizontal (X) axis. I’ve added some vertical lines to the next chart that allow you to see what is happening when smaller and larger numbers of subjects are run in a study. If you look up and down the line marked “20,” each of the points near to it represents a study in which 20 or so people were tested.


When you look at the points on the chart, the higher a point is, the older the average ages of the people tested were. The further left a point is, the smaller the sample of people tested was, and the further right a point is, the bigger the sample of people tested was.

Now we can turn our attention to the funky coloring of the chart. The color map represents a moving average of the number of words each group produced, depending on its size and average age. (The map also controls for average education in these studies, because, as I mentioned above, education is the best predictor of FAS scores, and we don’t want differences in education messing up our understanding of the relationship between age, sample size, and FAS scores).


If you look at the plot above, the pink color in the middle area of the chart picks out the area where people tested in studies produced the highest number of FAS words on average. As the pink shades to orange and then to yellow and finally to green, the changes in color represent a decrease in the average FAS score in each study. In terms of FAS scores, pink > orange > yellow > green, with the red lines illustrating the way the average scores change according to their relationship to age and sample size.

(A good way of thinking about the visualisation is to treat it like a map of a mountainous area, in which scores are elevations . The pink part of the plot is a high, dry plateau. As the elevations drop, the colors of the vegetation change, and at the lowest elevation, in the valley, the colors shift to green.)

Now that I’ve explained the layout of the plot, let me point out a few of the interesting things that it shows.


In the plot above, I’ve highlighted the relationship between age, sample size, and FAS scores in studies that tested smaller numbers of people. As you can see, in studies where relatively small groups of adults are asked to produce F A and S words, their average scores improve up to around age 30 (the points in the blue circle), and then performance changes very little, even after age 65 (the points in the red circle).

To further test this observation, I also looked at only those studies in which 40 or less people were tested, and where the average age of the people tested was either 35-or-less or else 65-and-older, using a more conventional averaging method.

Here’s the average performance in each age group:


As you can see, although this plot doesn’t control for education (and on average, although more education predicts improved FAS performance, the average level of education declines with age across these studies ) the FAS scores of the 35 or less and 65 or older groups are virtually identical. There is no evidence of decline.


Finally, the plot above highlights what happens as sample sizes grow. (Note: the points outside the colored area represent a cluster of studies for which education data was not available).

There are several things to note here:

1. The majority of the studies that tested slightly larger numbers of people have targeted older adults rather than younger adults.

2. All of the studies involving very large numbers of people targeted older adults.

3. In studies of older adults, the average number of F A and S words produced by each person declines as more and more people are tested.

4. Where slightly larger samples of younger adults were tested, the opposite appears to be true (though there are too few data points to be sure what to make of this).

Or in other words, if we take the published literature on aging at face value, then if you are an older adult, and I ask you to generate as many words beginning with F as you can in a minute, the number of words you will manage to produce will depend on how many other people I also intend to test. If by magic you know that I intend to test a lot of other people, you will only manage a few words, whereas if I only intend to test a few other people, you will manage a lot more!

Assuming we don’t believe in magic, this raises a question: How can this be?

The honest answer is that we really don’t know for sure. What I can tell you for certain is that when we compared the effect of sample size to age across these 134 studies, the number of people tested in a given study has a stronger influence on FAS scores than age does. (I should also add that the analysis of these studies which suggested that FAS performance is subject to age-related declines did not look at the relationship between sample size and test performance.)

In the paper, my colleagues and I offer a few speculations as to why performance and sample size might be related, but they are just that: speculation. Before I get to them, I want to make three things clear.

First, the goal of the studies I’m discussing here is to try to get a handle on healthy cognitive ageing. These studies set out to study minds that are not affected by any of the diseases, such as Alzheimer, that do damage to our cognitive capacities regardless of age. In the light of this goal, many of these studies explicitly tried to screen the people they tested for symptoms of these diseases.

Second, all of the studies assume that the number of words that each person produces on the FAS task is a measure of their underlying cognitive performance. That is, they assume that the number of FAS words an adult can produce serves as an objective index of that person’s memory retrieval processes.

Third, they assume that an adults’ performance on the FAS task will not be affected by the extra knowledge that comes with age and experience (or else they assume that after 20, or thereabouts, adults gain little in the way of extra knowledge; I’ll return to this dubious assumption in more detail in a later post).

What the results of these 134 published studies appear to show is that all of these assumptions are unwise.

To reiterate, the FAS task is a ridiculously simple measure: All it asks of the people it tests is that they retrieve some words from their memories. It seems clear from these 134 results that the number of words people generate is sensitive to the context of testing, and that people’s sensitivity to testing contexts changes with age. In the light of this, it follows that, taken alone, FAS task scores cannot be considered to be an objective measure of memory retrieval processes over the adult lifespan.

It also follows that unless we can isolate and control for the effects of testing (which these studies so clearly demonstrate), then the FAS task can tell us little about how cognition changes over time in a healthy mind. That’s because, in short, we are are unable to tell whether differences in FAS performance reflect changes in people’s cognitive processes, or whether they reflect the way that peoples’ sensitivity to testing contexts change over time. (If we extrapolate from the pattern of data we see here, it appears we could actually rig our answer to the question of whether people get better or worse at the FAS task as they age simply by studying larger or smaller groups of people.)

This is a cause for concern. Almost every idea we have about the way the cognitive capacities of healthy minds develop over a lifespan is based on tests like the FAS task, and on the results of studies like the 134 that we have looked at here. If these tests are flawed (or to be more precise, if our understanding of what these tests measure is flawed), then our current understanding of healthy cognitive ageing is likely to be flawed as well.

Ok, so how might context influence FAS scores?

To reiterate: I really don’t know.

Here are some aspects of the way studies are currently run that I think may be a cause for concern.

Human beings are social animals. It seems likely that where they are tested and who they are tested by will influence their performance. Will an older adult tested in a research laboratory at a University feel as comfortable as a college student tested in the same environment? If not, then a concern arises as to whether a test like the FAS task is measuring differences in social anxiety rather than differences in cognition.

Similarly, the social reality of science means that most of the people who participated in the studies reviewed here were not tested by the doctors and professors who actually wrote up the results. Most likely they were tested by researchers who were far closer to age 20 than age 70. Again, it does not seem too outlandish to suppose that people might feel more comfortable if they are tested by someone with whom they have more affinity, or that people will do better on tests when they feel more comfortable.

The people tested in studies are not the only people we have to worry about when it comes to context effects. As I noted above, many studies use exclusion criteria to try to avoid including adults with undiagnosed diseases in their samples of “healthily aging adults.” Recruiting people to take part in studies takes time, as does actually testing them. Do we really believe that exclusion criteria will be applied identically to subject 817 in a sample of 1000 and subject 17 in a sample of 20? If we suspect that the answer is no, then some part of the “decline” seen as test samples grow may reflect undiagnosed diseases, such as Alzheimer or dementia, further muddying our understanding of “healthy aging.”

And of course, testing is itself a human endeavor. While it might be feasible to have a single, well-chosen, well-trained researcher test a sample of 20 younger and 20 older adults in a study, and to hope for some uniformity of testing, when we increase the sample to 800 adults, this becomes less likely. Once again, this is likely to matter. (Interestingly, a great many scientists believe that when it comes to studying the mind, sample sizes can never be too large. Before I did these analyses, I thought this myself. I now doubt that this is always the case.)

There are probably many other things one could worry about along these lines. It is unlikely that psychological tests will ever have the objective nature of litmus tests. Given that the current state-of-the-art in testing cognition is light years away from a litmus test, the uncritical approach to the output of “cognitive tests” in both scientific and popular reports of studies of ageing is a further cause for concern.

In our paper, we try to show how we can use measures of the information in tasks, and the way this information changes both over time and with experience, to get a better handle on the way that we might expect people to perform in tasks as a function of their age. For example, by studying large samples of language, we can get an idea of how we might expect the performance of an idealized healthy adult to change on the FAS task over time.

In future posts, as well as explaining other aspects of our work in more detail, I’ll also try to show how that works.

Ramscar M, Hendrix P, Shaoul C, Milin P, & Baayen H (2014). The myth of cognitive decline: non-linear dynamics of lifelong learning. Topics in Cognitive Science, 6 (1), 5-42 PMID: 24421073