Sticks and stones (2): How names work

A stranger in the village

When people migrate to Sweden, they are given the option of exchanging their current last name for one that sounds a little more Swedish. The process is administered by the Patent- och registreringsverket  – the Patent and Registration Office (PRV) – and of the many rules it enforces, an important one requires anyone wanting to adopt an existing Swedish surname to prove their descent from at least two generations of people with that name who lived in the past 100 years. This, of course, is not something most immigrants can do. Instead they must  come up with a new name, and then apply to the PRV for permission to use it.

To help with the process of inventing new Swedish sounding names,  the PRV offers detailed guidance. This must come as a relief to anyone going down this path, because  Section 12, paragraph 1 of the Names Act (1982) states that that any candidate name that “in composition, pronunciation or spelling has such a linguistic form that it is not appropriate as a surname in this country” will be summarily rejected.

Although you might wonder why anyone would possibly go to all the trouble, many newcomers to Sweden clearly feel it’s worth it, and large numbers of them submit to the process every year. This in turn has created a kind of natural experiment, because it allows the fortunes of immigrants who adopt these new Swedish-sounding surnames to be compared to those immigrants who choose to stick with their existing, foreign surnames. And when economists Mahmood Arai and Peter Skogman Thoursie ran the numbers from this experiment, they found that immigrants who adopted Swedish sounding names earned 12 to 44% more in salary than immigrants who didn’t (depending on how exactly the difference was calculated).

Clearly there is more going on here than just the fact of how different names sound. Yet this finding again makes clear that people’s names can have an important impact on their lives.   And it also helps highlight is the fact that familiarity – and frequency – may not be as simple as they may first appear. In my last post, I described how the favorability with which people view a first name is closely correlated with its frequency: how often names occur in discourse (or how often people think they do). The new names immigrants adopt in Sweden clearly have low objective frequencies since it the PRVs rules make it likely they are unique, yet it seems likely that these rules also make these names seem more familiar – and frequent – than they actually are, simply because they force them to fit into the pattern of existing names. Which highlights the fact that there are more ways of measuring the information conveyed by names than simply counting frequency, a point I will return to in a future post.

For now, however,  I’m going stick with frequencies. This is not because other ways of measuring informativity aren’t important – far from it, hence my highlighting this point – but rather, as I’ll show, we can learn a lot about names even while ignoring the finer details of their informativity.  What is more important for my present purposes is that I firm up this last notion, and make clear what – exactly – the   “information” in  names is. And since names are clearly part of a communication system – language – Claude Shannon’s Mathematical Theory of Communication  (what is now called, “information theory”) seems like an obvious a place to start with this. The theory defines information  precisely and objectively  – it is the foundation of our information age –  yet, as Shannon himself pointed out, its precision often gets lost in translation. When other fields talk about the theory, vaguer, more colloquial understandings of “information” often seep back in,  and this vagueness problem seems especially acute with regards to language.  So, paying heed to Shannon’s warning, I shall now lay out explicitly what I mean by “information,” and how this applies to names.

Names, information and communication

Information theory was originally developed to solve a very specific problem: the accurate transmission of signal sequences over communication channels, such as electronic pulses over copper wires: Communication channels are prone to distorting signals, and what was needed was a method for sending signals in such a way that a receiver on the other end could separate out this distortion and recover the original message. Shannon’s genius was to see that the way to solve the problem was by considering the communication system as a whole, rather than by focusing on any one part of it. Accordingly, he redefined the question of coding and decoding signals in terms of discrimination: If the sender and the receiver are both aware of the scope of a set of messages (or code words) that can be transmitted across a system – that is, if they both know the same source code – then the signaling problem can be re-envisaged as one of discriminating the legal signals (which are defined by the code) from any illegal signals.

We can think about names in terms of Shannon’s systems-level approach by considering a set of individual identities and their corresponding names as providing a source code. The individual identities are the messages that people wish to communicate about, and spoken names serve as signals that, ideally, allow anyone hearing them to discriminate the intended identity without getting confused along the way.

Thinking about names in coding terms also helps to highlight a unique communication problem they pose. For most other things we signal using words, we are happy to use generic terms like dog, cat, idea, information. However when it comes to names, we want them to be sui generis. Think about names like John F. Kennedy, or Geoffrey of Wenhaston, or Kim Jong-il: Ideally, we would like a name to uniquely identify an individual, such that after hearing it, we will have discriminated the individual being identified from the set of all other possible identities. This is rather different from discriminating pots from pans, and it means that as compared to pots and pans, we will need a lot of names. And it also means that it would be best if names were somehow arranged so as to be easy to process and remember.

Since people often complain that names are anything but easy to remember, you may be forgiven for thinking at this point, “Thanks and so much for the theory!” But in fact, it not only turns out that information theory provides a nice way of understanding names, it can also show just how remarkably well-adapted traditional naming practices are to the job. And because information theory is quantitative, it can also allow me to show you just how much better actual native naming practices are as compared to the possible alternatives.

To show you what I mean, I’ll need some data. The following table, compiled by names researcher Alice Crook, summarizes the records for 7,035 babies, comprising 3,561 males and 3,474 females, baptized in the parish of Beith, in North Ayrshire, Scotland in the period 1701‒1800. After removing names that were illegible, Crook was able to analyze 6,903 records. These are the rates at which first names were given to these babies:

beithnamesAs you can see, over 90% of the boys were given 1 of just 10 male names, and over 90% of the girls were given 1 of just 10 female names. What’s also apparent is that both of these name distributions are far from random; rather, each exhibits a similarly skewed and systematic pattern. To understand why this is significant, you’ll need to know a little more about the way Shannon defined information (or, ‘entropy’).

In devising his solution to the communication problem, Shannon had to figure out how to make the signaling process both efficient and accurate. It was here that his genius came to the fore: If, in a communication system, the sender and receiver both share a common source code, then both the scope of possible signals, and the receiver’s moment-to-moment uncertainty as the signal is transmitted, can be defined in advance. It is this uncertainty that Shannon used to formally define “information.”

Under Shannon’s definition of information, the less expected something is, the more information is provides. This can seem a somewhat counterintuitive idea at first. Yet it can be explained in a fairly intuitive way if we return to the babies of Beith: If a naming system defines a code in which around 25% of boys are called John, but only 5% are called Hugh (as indeed was the case in Beith), then it follows that whenever a name occurs in conversation, it is more likely that it will turn out to be John rather than Hugh. Which means that when it comes to the uncertainty associated with a name, Beithers should have implicitly treated Hugh as being more uncertain than John. It also follows that when names are coded in this way, a listener hearing Hugh will inevitably be able to narrow down the set of possible identities more than when they hear John. And given that this means that once a listener has heard Hugh, she will be able guess a signaled identity better than had she heard John, it seems natural to say that Hugh provides more information about the identity being signaled  than John does

Accordingly, a subtle and yet very important aspect of this definition of information relates to the way  the information in the “words” of a code (the alphabet of symbols is used to build signals) is distributed. Shannon realized that if this distribution was skewed, such at any point in time, any one code word was systematically more or less likely to occur than any other – in the way the names of the babies of Beith are skewed – then the information in a system could be kept constant, such that on average, the absolute length of all of the signals communicated within it could be minimized.

The basic idea goes like this: In communication, a signal is built out of an alphabet of code words, and then transmitted between various points in a system (a sender and a receiver, or a speaker and a listener, etc.). What we need, if we want to optimize the communication in this kind of a system, is for the information in each part of the signal to have the same value at every point in the system (such that Hugh provides the same information about a given identity regardless of who is speaking). If this is our goal, then the least efficient way of organizing a code is to make all of the code words equally probable. In name terms, this would be equivalent to a situation where all names were equally distributed amongst everyone (or, even worse, where everyone had a unique name).

We can think about why this is a problem as follows: If you knew that the names John, Michael, and Paul were evenly distributed among the men in your town, and all the local men you were introduced to in the past month were called Paul, then the likelihood that the next man you would meet would be called John or Michael would have increased. And if such a scenario really obtained, then the information conveyed by John and Michael and Paul would vary wildly across the speakers of your community (the system), depending on the idiosyncratic experiences of whoever happened to be speaking or listening.

Shannon figured out how this problem could be avoided if the words in a code were distributed such that the system was effectively memoryless. And the way to make this happen is to exponentially skew the frequency with which each code word is used, such that the most frequent code word is exponentially more frequent than the second; the second most frequent code word is exponentially more frequent than the third; and so on.

Why is this called a  memoryless distribution, and why is it useful for communication purposes? Well,  if we think of a receiver as waiting for code words to “arrive” (i.e., how often they will be experienced), it turns out that if code words are distributed in this way, then even allowing for vagaries in sampling, each code word will tend to arrive at each different point in the communication system at a rate that is broadly representative of its objective frequency. And because of this, the addition of a memory that tracks the occurrence of code words would add nothing to the system. And it further follows that when codes are distributed like this,  the likelihood of any given code word appearing anywhere within the system will remain roughly constant across the system. Which will mean that any message transmitted in that system will convey the same amount of information, regardless of where it’s transmitted from or where it’s received. And this therefore ensures that the “information” value of a given message is the one defined by the source code itself.

When it comes to names, Shannon’s solution stands in contrast to the hypothetical situation elaborated above, in which Johns and Pauls and Michaels are equally likely. In the optimal case that Shannon described, the same set of male first names would be given out in a very skewed way, such that lot of men would be given one name, like John, and fewer another name, say Thomas, and still fewer some other name, like, say William. If, in such a system, folks were talking to one another and a given male name was uttered, that name would be roughly equally surprising (or predictable) to each person who heard or said it.

All of which suggests that if we somehow get the distribution of names right – that is, if parents dole out the Johns and Williams in just the right proportions – then the amount of information conveyed by individual names will tend to be fairly stable across the community. And if, in turn, the way that our children’s brains develop maximizes their ability to learn conventions (and it turns out that children’s brains do develop in just this way) then this will serve to further reinforce this effect.

Which is to say, that if we distribute names exponentially, and if we make frequent names shorter and easier to say, then not only will we achieve consistency among speakers, but everyone will (on average) spend less time talking and thinking about names (a principle Shannon proved  for communication codes, and which applies equally to names).

The social evolution of names

Suppose we assume for a moment that the naming practices of Beith had evolved over generations to make communicating about names easier, and that the skewed distribution of Margarets, Jeans, and Agneses in Beithers’ naming of their daughters implicitly reflects this. Given such an assumption, we can ask: How well did their practices serve these purposes?

To answer this question, I plotted the distribution of names we saw earlier, and fitted to an exponential function to each plot, so that the empirical distribution can be compared to the information theoretic ideal. The vertical axis shows the number of baptisms, and the horizontal axis the names ranked from most to least frequent.

Here’s what this looks like for the girls:


And here’s the boys:


What these plots show is that in the hundred years between 1701 and 1800, if we assume that the good citizen’s of Beith were choosing their children’s names so as to maximize communicative efficiency (in Shannon’s terms at least), then for both boys and girls, they managed to achieve this to quite a remarkable degree. Indeed, the fit is close to perfect.

Of course, we can be fairly confident that the citizens of Beith were not consciously trying to do this. (Not least because information theory wouldn’t be invented until some 200 years later.) Yet whatever it is they actually believed they were doing in naming their sons and daughters, these beliefs had the happy, added benefit of making the Beithers’ communication about identities remarkably efficient.

Moreover, we can further quantify these effects, so as to compare what Beithers actually did to what they might otherwise have done. If we look carefully at the distributions, we find that while 93% of boys were given one of just 10 names (John, William, Robert, James, Hugh, Thomas, Andrew, David, Alexander, and Matthew), the total male name inventory in Beith amounted to 50 (17 of which occurred just once); the distribution of female names was much the same. Estimating the information in the full distribution reveals that thought or talk involving the first names of the boys or girls of Beith would have required processing around 3.3 bits of information.

At which point I should acknowledge that ‘bits’ are not the easiest of measures to intuitively grasp!

Thankfully, there is another way of thinking about this as a unit of measurement. Earlier, I described how the least efficient way to distribute code words is to make them equally probable. This yields a nice trick that computational linguists like to use to make the idea of a ‘bit’ more comprehensible: If we raise 2 to the bit value of a complex, highly skewed distribution, then this transformation will serve to tell us what that bit value represents in terms of the number of options in a distribution where everything is equally likely.

For example, if the 3.3 bits of information in the 50 names given to the boys of Beith is transformed in this way, we get a value of 10. This value is called the “perplexity” of 3.3 bits, because it represents the perplexity one might feel at being asked to choose between 10 equally plausible (or attractive) options.

It then becomes straightforward to compare this value to what it might have been if names had been shared around equally: Obviously, if 50 boys names were used in Beith, then if they were all equiprobable, it follows that they would have had a perplexity value of, well, 50. Which means that by doing what they were actually doing, and not just choosing from a list of names at random, the burghers of Beith succeeded in making names 5 times more efficient. What’s more, if we compare the empirical distribution to a situation where each parent gave their child a unique name, it is over 350 times more efficient. (And, of course, given their sui generis function — and the problems people have with names — it is  efficiency is an important consideration important here.)

Historically, it turns out that the parish of Beith was far from unique. In every 50-year period from 1550-1599 to 1750-1799, around 50% of all boys born in England were named William, John or Thomas, and around 50% of all girls were named Elizabeth, Mary or Anne. Across a large slice of English history, it is clear that Beith’s name distribution was the rule, and not an exception. Which means that insofar as historical naming practices in Britain are concerned, this analysis of the communicative efficiency of names represents a wider statistical norm.

How confusing are John, William and Thomas?

One last detail of communication theory that is particularly relevant to understanding names was inspired by the experiences of Richard Hamming, an engineer at Bell Labs in the late 1940s. Hamming worked on the Model V computer, a punch card fed electromechanical machine that had a processing cycle that could be counted in seconds (as compared to the milliseconds of a modern laptop). As Hamming fed in the cards that contained his programs, he grew frustrated with continually having to restart the process from scratch each and every time the unreliable card reader on the machine made an error.  Engineer that he was, Hamming began to pay attention to the “error-correcting” bits that he had to put into the code he was fed into the machine, and to think about how they might be best arranged so as to maximize their efficiency.

Hamming soon figured out that the main problem caused by an incorrect reading was that of flipping – a one would be flipped to a zero – and that the “distance” between two code words could be defined in terms of the number of bits that would have to be flipped in order to turn one code word into another (this is now called the Hamming distance, after him). While maximizing this distance would reduce the possibility of error, it also risked reducing efficiency. Hamming realized that the solution to this problem lay in devising a code that optimized the trade-off between the Hamming distance of each code word, on one hand, and the overall number of bits that needed to be added to achieve this, on the other.

When it comes to thinking about how names work in communication, we can equate Hamming’s flipping distances to the degree of confusability between one name and another: if common names take the form of words like John, William and Thomas, and Mary, Anne and Elizabeth, which share very few acoustic similarities, then, if follows that unless they are spoken by a sender with a very high degree of inaccuracy, it is going to be hard for a listener to confuse them. On the other hand, if common names took the form, John, Gone and One, and Mary, Vary and Wary (whose acoustic differences are minimal), then far more mishaps will arise, simply because the difficulty of perceptually discriminating between the names would have been dramatically increased, leaving a listener far less certain about which name a speaker is producing.

Meet my friend Mary…

From an information theoretic point of view, the traditional system of English names appears to represent a formidably well-adapted communication system, in which the internal structures of its “code words” (John, William, and Thomas) are arranged to minimize flipping, and the distribution of names approximates the memoryless ideal prescribed for communication systems. In other words, this is a system that appears to be optimized to facilitate the discriminative processes that best characterize both communication and learning.

The memoryless distribution of names in Beith will have served to guarantee – at least to the extent possible – that despite any differences in their individual histories, every inhabitant of the community who had gained a certain amount of linguistic experience would have internalized a ‘name model’ in which John, William, Thomas, etc. were assigned the same information value, such that the specific uncertainty associated with any given name would be the same for each and every member of that community. And this would mean that as they spoke about their peers, the citizens of Beith would know (implicitly) that both they and their listeners would anticipate John with a fairly high degree of certainty, and Darius with a fairly low degree of certainty, such that they would (again, implicitly) know that they were free to mumble John, but that they probably ought to take more care with Darius.

Accordingly, on hearing John or Darius, or Anne or Doreen in a conversation, Beithers would begin to reduce their uncertainty about the identities that speakers were signaling. Of course, the extent of that reduction would vary: When it came to Darius or Doreen, a listener might have gleaned sufficient information to be fairly certain about the identity signaled, given the relatively sparse population of Dariuses and Doreens. By contrast, in many contexts, hearing John or Anne might have merely narrowed the set of all possible identities to the smaller (and thus relatively less uncertain) set of identities that begin with John or Anne. However, when the identity of John or Anne wasn’t immediately disambiguated by context, the system wasn’t done yet: a surname could be supplied.

And that’s not all: So far, I have described how English naming practices evolved to optimize communicative efficiency for first names. Yet remarkably, there is good reason to believe that what had in fact evolved was a system to signal identity through multiple successive name elements. Recall that if we have lots of Johns and fewer Dariuses, then the uncertainty after hearing John will be greater than that after Darius. English names appear well-equipped to handle this: If we take the most frequent first names in England (and Scotland, because the exact details vary regionally) – John, William, Thomas, Robert – the 100 Million word British National Corpus reveals that these names, but not others, are (or, at least were) highly likely to be followed by an infix: the, of, or de. And, of course, dividing up the Johns’ surnames by these infixes (as well using no infix) will have provided plenty of scope for keeping the overall level of uncertainty associated with any given name in check.

In this system, when a surname ultimately arrived, ideally one of two things would happen:

  1. The supply of a surname would serve to eliminate a listener’s uncertainty about the identity being signaled because, taken together, the first name and surname would serve to discriminate a unique entry in the store of identities in the listener’s memory.


  1. The supply of a surname would indicate to the listener that the identity being signaled is a new one – that is, when taken together, the first name and surname would not match any existing entry in the listener’s memory. However, thanks to the design of this naming system, the listener would have some important information: they would know that they had heard a name (and not something else), and they would be able to infer that somewhere out there, there is a yet to be identified someone whose name they just encountered.

However, one further possibility also exists. One that, in the modern world at least, we are all too familiar with:

  1. The supply of a surname will not serve to eliminate a listener’s uncertainty about the identity being signaled because more than one individual is identified by that name.

Mr. Kim and Mrs. Johnson

In my last post, I described how, historically in England, Geoffrey of Wenhaston and Walpole could be the father of Geoffrey of Bramfield. And after Geoffrey One and Geoffrey Two died, the Geoffrey’s lands could pass on to Geoffrey One’s daughters, and ultimately – after the death of the sisters – to the eldest son of the youngest daughter, who would turn out to be called Geoffrey Three. I also described how modern naming practices have come to be shaped by legislation, so that in England, Robert Williamsons no longer beget Walter Robertsons who beget James Walters.

While this legislation might seem to simply codify ancient practices, history reveals a different story. The modern English name system makes it impossible for a William De Malpas to unremarkably beget a Philip Gogh (the latter means redhead) whose children then, without any fuss or ado, take the surnames Egerton and Golborne. Yet this is exactly what happened in just one line of descent of the children of William Belward, “Lord of the moiety of Malpas” in 11th century England (described in William Camden’s Remains, one of the few contemporary sources of information we have on historical English naming practices).

What is interesting about examining the early English name system from an information theoretic perspective is that it reveals something that has hitherto gone unnoticed: Looking carefully, we can see that the ‘system’ actually contains within it a number of what we now think of as distinct, alternative name systems. There are the family patronyms that we think of as standard ‘English names,’ in which Robert Malet begets Walter Malet. There are patronyms in which Robert Williamson begets Walter Robertson, a naming system native to a range of societies, from the Vikings of the North, to the Ashkenazim, the nomadic Jews of Europe, and which lives on in Iceland even today. And then there are the Geoffrey’s, where Geoffrey of Wenhaston and Walpole begets Geoffrey of Bramfield, and so on (interestingly, in the Beith data, in over half of the families that had sons, the father’s name reappeared as a son’s name). While this latter name system might seem odd to a modern English speaker – who might not even recognize it as a system – it might not seem so strange to a modern speaker of Chinese or Korean.

Prior to the Qin Dynasty (3rd century BC), China was largely a feudal (fengjian) society, and its naming practices were as flexible and local as those we see in twelfth century England. Surnames were localized to the ruling houses and their related lines, and largely absent among commoners. Which is to say that in China in the 3rd century BC, as in twelfth century England, surnames were foreign to the overwhelming majority of the population. As the Qin dynasty began to establish itself, it began imposing surnames on the population of China, and enumerating them for purposes of taxation, and the conscription of its citizens into forced labor, or the militia. And interestingly, unlike in England, which adopted the Robert Malet system, when the laws imposing names were codified in China, the state settled on the system exemplified by the Geoffreys.

Today, that system is still in use in China, as well as other parts of what is often called the Sinosphere, in countries such as Korea, Taiwan, and Singapore. To modern eyes, the Chinese name system might appear to be very different to that used in the USA and Europe. Indeed, the first, and most striking thing that becomes apparent in a comparison between Sinosphere surnames and those used in the West, is that is they are first names, not last names. Thus the despotic state of North Korea has been run for over half a century by a family dynasty in which power and the presidency has passed from Kim Il-sung to his son Kim Jong-il and thence to his grandson Kim Jong-un, in exactly the same way as land once passed from Geoffrey of Wenhaston and Walpole to Geoffrey of Bramfield to Geoffrey III.

Of course, the other striking thing about Sinosphere surnames is that there are very few of them. When the colloquial expressions bǎixìng (which literally means, “hundred surnames”) and laobaixing (“old hundred surnames”) are used in Chinese, they are used to refer to the “ordinary folks” or “the people.” 85% of the Chinese population have one of these 100 common first (“family”) names. Strikingly, if we return to 18th Century Beith, we find the entire name stock of the parish for the period 1701-1800 comprised just 50 male and 62 female names: that is, a total of 112 distinct first names.

Indeed, if we were to ignore the many contrivances of states and lawmakers, and look simply at the order and frequency of names, and the information they convey, we might notice that historically, in China, England, and Korea (and much of the rest of the planet) around 50% of the population have shared a just handful of first names. We might notice that in many apparently different cultures and languages, pretty much everyone could be accounted for by some 100 first names, and that last names were far, far more diverse across the board. We might also notice that Chinese “family names” (which come first) share a lot of interesting properties with Western first names, and that Chinese given names (which come last) also share a lot of interesting properties with Western family names (which also come last). We might even start to wonder what – exactly – a family name is, anyway.

Historically, native naming systems were flexible. Surnames appear to have been geared towards providing information to discriminate between identities, rather than towards coding for heredity as “family names” now do. Yet the legislation that imposed fixed systems of names on most of the world’s population has now curtailed this flexibility. As I will discuss in my next post, it seems likely that this has done much to exacerbate – and indeed, set in stone – the problems that arise when a name fails to uniquely identify an individual (possibility #3 above).

In my last post, I described how laws initially aimed at bringing about Jewish emancipation led to the coding of ethnicity in the names imposed on Europe’s Jews, and how in time this helped facilitate the Nazi state’s attempts at Jewish extermination. I also described how names like Jamal and Lakisha hinder the attempts of African-Americans’ to find jobs. It would be easy to attribute this to the prejudices of the people reading those names on resumes, but, as I’ll show in my next post (completing this trilogy), the truth is more complicated. History and legislation have contrived to systematically imprison African-Americans in yet another “ghetto of names,” such that many of the causes of systematic discrimination that African-Americans experience in the job market appear to lie not in the minds of others, but in the legislative institutions that have come to shape the names we give our children.
Arai, M., & Skogman Thoursie, P. (2009). Renouncing Personal Names: An Empirical Examination of Surname Change and Earnings Journal of Labor Economics, 27 (1), 127-147 DOI: 10.1086/593964

Brown, P. F., Pietra, V. J. D., Mercer, R. L., Pietra, S. A. D., & Lai, J. C. (1992). An estimate of an upper bound for the entropy of English Computational Linguistics, 18 (1), 31-40

Crook, A. (2012). Personal names in 18th-century Scotland: a case study of the parish of Beith (North Ayrshire) Scottish Place-Name News, 32, 8-9

Shannon, C. (1948). A Mathematical Theory of Communication Bell System Technical Journal, 27 (3), 379-423 DOI: 10.1002/j.1538-7305.1948.tb01338.x

Shannon, C., Gallager, R., & Berlekamp, E. (1967). Lower bounds to error probability for coding on discrete memoryless channels. I Information and Control, 10 (1), 65-103 DOI: 10.1016/S0019-9958(67)90052-6

Smith-Bannister, S. (1997). Names and naming patterns in England, 1538-1700. Oxford University Press