Research Digest

Researchers Calculate Acquiring A Language Requires Learning 1.5 Megabytes Of Data, With Implications For Psychological Theory

By Emma Young

How do we acquire our native language? Are the basics of language and grammar innate, as nativists argue? Or, as empiricists propose, is language something we must learn entirely from scratch? 

This debate has a long history. To get at an answer, it’s worth setting the theories aside and instead looking at just how much information must be learned in order to speak a language with adult proficiency, argue Francis Mollica at the University of Rochester, US, and Steven Piantadosi at the University of California, Berkeley. If the amount is vast, for instance, this could indicate that it’s impracticable for it all to be learned without sophisticated innate language mechanisms. In their new paper, published in Royal Society Open Science, Mollica and Piantadosi present results suggesting that some language-specific knowledge could be innate – but probably not the kind of syntactic knowledge (the grammatical rules underlying correct word order) that nativists have tended to argue in favour of. Indeed, their work suggests that the long-running focus on whether syntax is learned or innate has been misplaced. 

Mollica and Piantadosi worked out “back-of-the-envelope” upper, lower and “best guess” estimates of how much information we have to absorb in order to acquire various aspects of language — to identify phonemes (units of sound), word-forms, and lexical semantics (the meaning of words and the relationships between them), as well as syntax, for example. 

The maths that they used to do this is complex. (If you’re fond of equations that span a page-width, you should definitely check theirs out.) But their fundamental approach is to compute the number of “bits” required to specify an outcome  — learning the meaning of a word, for example — “from a plausible space of logically possible alternatives”. 

Using this approach, they estimate that storing essential knowledge about phonemes takes up only about 750 bits (a bit is a binary unit of information used in computing; 8 million bits is equivalent to 1 megabyte). However, a typical adult vocabulary of about 40,000 words involves perhaps about 400,000 bits of lexical knowledge. Storing information about what all these words mean is more demanding: the researchers’ best guess is somewhere in the region of 12,000,000 bits. A language-learner would also need to store about 80,000 bits of information about word frequency, they suggest. 

Next, Mollica and Piantadosi turned to syntax. “Syntax has traditionally been the battleground for debates about how much information is built-in versus learned,” they write. “In the face of massively incompatible and experimentally under-determined syntactic theories, we aim here to study the question in a way that is as independent as possible.” In fact, they estimate that we need to store only a very small amount of data about syntax — perhaps only 667 bits. According to these estimates, having innate knowledge of syntax wouldn’t be especially helpful, as acquiring it is relatively undemanding. 

Syntactic knowledge may not require a huge amount of knowledge but the total amount of language-related information that must be stored by a proficient language speaker is massive: around 1.5 megabytes. If correct, this would mean that up until the age of 18, a child would have to remember, on average, 1000 to 2000 bits of information every day. The researchers’ very lowest estimate is that reaching adult language proficiency would require that a child learn 120 bits per day.

“To put our lower estimate in perspective, each day for 18 years a child must wake up and remember, perfectly and for the rest of their life, an amount of information equivalent to the information in this sequence:

011010000110100101100100011001000110010101101110011000010110001101100011

011011110111001001100100011010010110111101101110”

Such a “remarkable feat of cognition” suggests that language acquisition is grounded in “remarkably sophisticated mechanisms for learning, memory and inference,” the pair comment. 

These are ballpark-type figures, they stress. But, even so, they argue that their estimates suggest that neither the nativist nor the empiricist approach provides a viable account of how we come to represent lexical semantics (word meanings) – which their work indicates is overwhelmingly the biggest language mountain to conquer. 

“Our results suggest that if any language-specific knowledge is innate, it is most likely for helping tackle the immense challenge of learning lexical semantics, rather than other domains with learnability problems that require orders of magnitude less information,” Mollica and Piantadosi  conclude.

Humans store about 1.5 megabytes of information during language acquisition

Emma Young (@EmmaELYoung) is Staff Writer at BPS Research Digest