Researchers Calculate Acquiring A Language Requires Learning 1.5 Megabytes Of Data, With Implications For Psychological Theory

futuristic brainBy Emma Young

How do we acquire our native language? Are the basics of language and grammar innate, as nativists argue? Or, as empiricists propose, is language something we must learn entirely from scratch? 

This debate has a long history. To get at an answer, it’s worth setting the theories aside and instead looking at just how much information must be learned in order to speak a language with adult proficiency, argue Francis Mollica at the University of Rochester, US, and Steven Piantadosi at the University of California, Berkeley. If the amount is vast, for instance, this could indicate that it’s impracticable for it all to be learned without sophisticated innate language mechanisms. In their new paper, published in Royal Society Open Science, Mollica and Piantadosi present results suggesting that some language-specific knowledge could be innate – but probably not the kind of syntactic knowledge (the grammatical rules underlying correct word order) that nativists have tended to argue in favour of. Indeed, their work suggests that the long-running focus on whether syntax is learned or innate has been misplaced. 

Mollica and Piantadosi worked out “back-of-the-envelope” upper, lower and “best guess” estimates of how much information we have to absorb in order to acquire various aspects of language — to identify phonemes (units of sound), word-forms, and lexical semantics (the meaning of words and the relationships between them), as well as syntax, for example. 

The maths that they used to do this is complex. (If you’re fond of equations that span a page-width, you should definitely check theirs out.) But their fundamental approach is to compute the number of “bits” required to specify an outcome  — learning the meaning of a word, for example — “from a plausible space of logically possible alternatives”. 

Using this approach, they estimate that storing essential knowledge about phonemes takes up only about 750 bits (a bit is a binary unit of information used in computing; 8 million bits is equivalent to 1 megabyte). However, a typical adult vocabulary of about 40,000 words involves perhaps about 400,000 bits of lexical knowledge. Storing information about what all these words mean is more demanding: the researchers’ best guess is somewhere in the region of 12,000,000 bits. A language-learner would also need to store about 80,000 bits of information about word frequency, they suggest. 

Next, Mollica and Piantadosi turned to syntax. “Syntax has traditionally been the battleground for debates about how much information is built-in versus learned,” they write. “In the face of massively incompatible and experimentally under-determined syntactic theories, we aim here to study the question in a way that is as independent as possible.” In fact, they estimate that we need to store only a very small amount of data about syntax — perhaps only 667 bits. According to these estimates, having innate knowledge of syntax wouldn’t be especially helpful, as acquiring it is relatively undemanding. 

Syntactic knowledge may not require a huge amount of knowledge but the total amount of language-related information that must be stored by a proficient language speaker is massive: around 1.5 megabytes. If correct, this would mean that up until the age of 18, a child would have to remember, on average, 1000 to 2000 bits of information every day. The researchers’ very lowest estimate is that reaching adult language proficiency would require that a child learn 120 bits per day.

“To put our lower estimate in perspective, each day for 18 years a child must wake up and remember, perfectly and for the rest of their life, an amount of information equivalent to the information in this sequence:



Such a “remarkable feat of cognition” suggests that language acquisition is grounded in “remarkably sophisticated mechanisms for learning, memory and inference,” the pair comment. 

These are ballpark-type figures, they stress. But, even so, they argue that their estimates suggest that neither the nativist nor the empiricist approach provides a viable account of how we come to represent lexical semantics (word meanings) – which their work indicates is overwhelmingly the biggest language mountain to conquer. 

“Our results suggest that if any language-specific knowledge is innate, it is most likely for helping tackle the immense challenge of learning lexical semantics, rather than other domains with learnability problems that require orders of magnitude less information,” Mollica and Piantadosi  conclude.

Humans store about 1.5 megabytes of information during language acquisition

Emma Young (@EmmaELYoung) is Staff Writer at BPS Research Digest

One thought on “Researchers Calculate Acquiring A Language Requires Learning 1.5 Megabytes Of Data, With Implications For Psychological Theory”

  1. This seems like an ass backwards way to understand language acquisition. Children seem to learn language by attaching words to concepts and relationships that they already understand, it’s an act of labeling known things rather than learning new things.

    To take the simplest example I can think of, a baby understands “a nipple” as something it wants to latch on to, has a smell, gives it something to swallow, which satisfies their hunger. It’s a concept without a label. This becomes associated with a specific face, smell, sound, texture and so on. It’s only later the label “mama” is attached to the concepts and associations, and much later that the labels are vocalized. The concept set for a child is small, and it’s built one concept at a time. Then they’re labeled. Language seen this way is a simplification, an effort towards information efficency, rather than bytes of information. If you must use computer related terms, it’s more like a better coding language or a way of compressing files.

    Thinking about it in the same terms you think about learning language later in life is unhelpful. learning a second language is difficult because you’re learning a new way of labeling a really complex set of concepts, and a way of describing the relationships and associations between these concepts, which aren’t necessarily directly interchangeable with your current set of concepts and associations. The act is hard because the concept set is so much larger and more complex, and each new label has a large set of associations and relationships that aren’t yet understood in the new labeling system.

    Obviously, this labeling information can be understood in terms of bytes but you’re swapping full spreadsheets for file names in your working memory, so it seems odd to see it as extra effort rather than simplification.

Comments are closed.