Statistical learning in language acquisition

Statistical learning is the ability for humans and other animals to extract statistical regularities from the world around them to learn about the environment. Although statistical learning is now thought to be a generalized learning mechanism, the phenomenon was first identified in human infant language acquisition.

The earliest evidence for these statistical learning abilities comes from a study by Jenny Saffran, Richard Aslin, and Elissa Newport, in which 8-month-old infants were presented with nonsense streams of monotone speech. Each stream was composed of four three-syllable "pseudowords" that were repeated randomly. After exposure to the speech streams for two minutes, infants reacted differently to hearing "pseudowords" as opposed to "nonwords" from the speech stream, where nonwords were composed of the same syllables that the infants had been exposed to, but in a different order. This suggests that infants are able to learn statistical relationships between syllables even with very limited exposure to a language. That is, infants learn which syllables are always paired together and which ones only occur together relatively rarely, suggesting that they are parts of two different units. This method of learning is thought to be one way that children learn which groups of syllables form individual words.

Since the initial discovery of the role of statistical learning in lexical acquisition, the same mechanism has been proposed for elements of phonological acquisition, and syntactical acquisition, as well as in non-linguistic domains. Further research has also indicated that statistical learning is likely a domain-general and even species-general learning mechanism, occurring for visual as well as auditory information, and in both primates and non-primates.

Lexical acquisition
The role of statistical learning in language acquisition has been particularly well documented in the area of lexical acquisition. One important contribution to infants' understanding of segmenting words from a continuous stream of speech is their ability to recognize statistical regularities of the speech heard in their environments. Although many factors play an important role, this specific mechanism is powerful and can operate over a short time scale.

Original findings


It is a well-established finding that, unlike written language, spoken language does not have any clear boundaries between words; spoken language is a continuous stream of sound rather than individual words with silences between them. This lack of segmentation between linguistic units presents a problem for young children learning language, who must be able to pick out individual units from the continuous speech streams that they hear. One proposed method of how children are able to solve this problem is that they are attentive to the statistical regularities of the world around them. For example, in the phrase "pretty baby", children are more likely to hear the sounds pre and ty heard together during the entirety of the lexical input around them than they are to hear the sounds ty and ba together. In an artificial grammar learning study with adult participants, Saffran, Newport, and Aslin found that participants were able to locate word boundaries based only on transitional probabilities, suggesting that adults are capable of using statistical regularities in a language-learning task. This is a robust finding that has been widely replicated.

To determine if young children have these same abilities Saffran Aslin and Newport exposed 8-month-old infants to an artificial grammar. The grammar was composed of four words, each composed of three nonsense syllables. During the experiment, infants heard a continuous speech stream of these words. The speech was presented in a monotone with no cues (such as pauses, intonation, etc.) to word boundaries other than the statistical probabilities. Within a word, the transitional probability of two syllable pairs was 1.0: in the word bidaku, for example, the probability of hearing the syllable da immediately after the syllable bi was 100%. Between words, however, the transitional probability of hearing a syllable pair was much lower: After any given word (e.g., bidaku) was presented, one of three words could follow (in this case, padoti, golabu, or tupiro), so the likelihood of hearing any given syllable after ku was only 33%.

To determine if infants were picking up on the statistical information, each infant was presented with multiple presentations of either a word from the artificial grammar or a nonword made up of the same syllables but presented in a random order. Infants who were presented with nonwords during the test phase listened significantly longer to these words than infants who were presented with words from the artificial grammar, showing a novelty preference for these new nonwords. However, the implementation of the test could also be due to infants learning serial-order information and not to actually learning transitional probabilities between words. That is, at test, infants heard strings such as dapiku and tilado that were never presented during learning; they could simply have learned that the syllable ku never followed the syllable pi.

To look more closely at this issue, Saffran Aslin and Newport conducted another study in which infants underwent the same training with the artificial grammar but then were presented with either words or part-words rather than words or nonwords. The part-words were syllable sequences composed of the last syllable from one word and the first two syllables from another (such as kupado). Because the part-words had been heard during the time when children were listening to the artificial grammar, preferential listening to these part-words would indicate that children were learning not only serial-order information, but also the statistical likelihood of hearing particular syllable sequences. Again, infants showed greater listening times to the novel (part-) words, indicating that 8-month-old infants were able to extract these statistical regularities from a continuous speech stream.

Further research
This result has been the impetus for much more research on the role of statistical learning in lexical acquisition and other areas (see ). In a follow-up to the original report, Aslin, Saffran, and Newport found that even when words and part words occurred equally often in the speech stream, but with different transitional probabilities between syllables of words and part words, infants were still able to detect the statistical regularities and still preferred to listen to the novel part-words over the familiarized words. This finding provides stronger evidence that infants are able to pick up transitional probabilities from the speech they hear, rather than just being aware of frequencies of individual syllable sequences.

Another follow-up study examined the extent to which the statistical information learned during this type of artificial grammar learning feeds into knowledge that infants may already have about their native language. Infants preferred to listen to words over part-words, whereas there was no significant difference in the nonsense frame condition. This finding suggests that even pre-linguistic infants are able to integrate the statistical cues they learn in a laboratory into their previously acquired knowledge of a language. In other words, once infants have acquired some linguistic knowledge, they incorporate newly acquired information into that previously acquired learning.

A related finding indicates that slightly older infants can acquire both lexical and grammatical regularities from a single set of input, suggesting that they are able to use outputs of one type of statistical learning (cues that lead to the discovery of word boundaries) as input to a second type (cues that lead to the discovery of syntactical regularities. At test, 12-month-olds preferred to listen to sentences that had the same grammatical structure as the artificial language they had been tested on rather than sentences that had a different (ungrammatical) structure. Because learning grammatical regularities requires infants to be able to determine boundaries between individual words, this indicates that infants who are still quite young are able to acquire multiple levels of language knowledge (both lexical and syntactical) simultaneously, indicating that statistical learning is a powerful mechanism at play in language learning.

Despite the large role that statistical learning appears to play in lexical acquisition, it is likely not the only mechanism by which infants learn to segment words. Statistical learning studies are generally conducted with artificial grammars that have no cues to word boundary information other than transitional probabilities between words. Real speech, though, has many different types of cues to word boundaries, including prosodic and phonotactic information.

Together, the findings from these studies of statistical learning in language acquisition indicate that statistical properties of the language are a strong cue in helping infants learn their first language.

Phonological acquisition
There is much evidence that statistical learning is an important component of both discovering which phonemes are important for a given language and which contrasts within phonemes are important. Having this knowledge is important for aspects of both speech perception and speech production.

Distributional learning
Since the discovery of infants' statistical learning abilities in word learning, the same general mechanism has also been studied in other facets of language learning. For example, it is well-established that infants can discriminate between phonemes of many different languages but eventually become unable to discriminate between phonemes that do not appear in their native language; however, it was not clear how this decrease in discriminatory ability came about. Maye et al. suggested that the mechanism responsible might be a statistical learning mechanism in which infants track the distributional regularities of the sounds in their native language. To test this idea, Maye et al. exposed 6- and 8-month-old infants to a continuum of speech sounds that varied on the degree to which they were voiced. The distribution that the infants heard was either bimodal, with sounds from both ends of the voicing continuum heard most often, or unimodal, with sounds from the middle of the distribution heard most often. The results indicated that infants from both age groups were sensitive to the distribution of phonemes. At test, infants heard either non-alternating (repeated exemplars of tokens 3 or 6 from an 8-token continuum) or alternating (exemplars of tokens 1 and 8) exposures to specific phonemes on the continuum. Infants exposed to the bimodal distribution listened longer to the alternating trials than the non-alternating trials while there was no difference in listening times for infants exposed to the unimodal distribution. This finding indicates that infants exposed the bimodal distribution were better able to discriminate sounds from the two ends of the distribution than were infants in the unimodal condition, regardless of age. This type of statistical learning differs from that used in lexical acquisition, as it requires infants to track frequencies rather than transitional probabilities, and has been named "distributional learning".

Distributional learning has also been found to help infants contrast two phonemes that they initially have difficulty in discriminating between. Maye, Weiss, and Aslin found that infants who were exposed to a bimodal distribution of a non-native contrast that was initially difficult to discriminate were better able to discriminate the contrast than infants exposed to a unimodal distribution of the same contrast. Maye et al. also found that infants were able to abstract features of a contrast (i.e., voicing onset time) and generalize that feature to the same type of contrast at a different place of articulation, a finding that has not been found in adults.

In a review of the role of distributional learning on phonological acquisition, Werker et al. note that distributional learning cannot be the only mechanism by which phonetic categories are acquired. However, it does seem clear that this type of statistical learning mechanism can play a role in this skill, although research is ongoing.

Perceptual magnet effect
A related finding regarding statistical cues to phonological acquisition is a phenomenon known as the perceptual magnet effect. In this effect, a prototypical phoneme of a person's native language acts as a "magnet" for similar phonemes, which are perceived as belonging to the same category as the prototypical phoneme. In the original test of this effect, adult participants were asked to indicate if a given exemplar of a particular phoneme differed from a referent phoneme. If the referent phoneme is a non-prototypical phoneme for that language, both adults and 6-month-old infants show less generalization to other sounds than they do for prototypical phonemes, even if the subjective distance between the sounds is the same. That is, adults and infants are both more likely to notice that a particular phoneme differs from the referent phoneme if that referent phoneme is a non-prototypical exemplar than if it is a prototypical exemplar. The prototypes themselves are apparently discovered through a distributional learning process, in which infants are sensitive to the frequencies with which certain sounds occur and treat those that occur most often as the prototypical phonemes of their language.

Syntactical acquisition
A statistical learning device has also been proposed as a component of syntactical acquisition for young children. Early evidence for this mechanism came largely from studies of computer modeling or analyses of natural language corpora. These early studies focused largely on distributional information specifically rather than statistical learning mechanisms generally. Specifically, in these early papers it was proposed that children created templates of possible sentence structures involving unnamed categories of word types (i.e., nouns or verbs, although children would not put these labels on their categories). Children were thought to learn which words belonged to the same categories by tracking the similar contexts in which words of the same category appeared.

Later studies expanded these results by looking at the actual behavior of children or adults who had been exposed to artificial grammars. These later studies also considered the role of statistical learning more broadly than the earlier studies, placing their results in the context of the statistical learning mechanisms thought to be involved with other aspects of language learning, such as lexical acquisition.

Experimental results
Evidence from a series of four experiments conducted by Gomez and Gerken suggests that children are able to generalize grammatical structures with less than two minutes of exposure to an artificial grammar. In the first experiment, 11-12 month-old infants were trained on an artificial grammar composed of nonsense words with a set grammatical structure. At test, infants heard both novel grammatical and ungrammatical sentences. Infants oriented longer toward the grammatical sentences, in line with previous research that suggests that infants generally orient for a longer amount of time to natural instances of language rather than altered instances of language e.g.,. (This familiarity preference differs from the novelty preference generally found in word-learning studies, due to the differences between lexical acquisition and syntactical acquisition.) This finding indicates that young children are sensitive to the grammatical structure of language even after minimal exposure. Gomez and Gerken also found that this sensitivity is evident when ungrammatical transitions are located in the middle of the sentence (unlike in the first experiment, in which all the errors occurred at the beginning and end of the sentences), that the results could not be due to an innate preference for the grammatical sentences caused by something other than grammar, and that children are able to generalize the grammatical rules to new vocabulary.

Together these studies suggest that infants are able to extract a substantial amount of syntactic knowledge even from limited exposure to a language. Children apparently detected grammatical anomalies whether the grammatical violation in the test sentences occurred at the end or in the middle of the sentence. Additionally, even when the individual words of the grammar were changed, infants were still able to discriminate between grammatical and ungrammatical strings during the test phase. This generalization indicates that infants were not learning vocabulary-specific grammatical structures, but abstracting the general rules of that grammar and applying those rules to novel vocabulary. Furthermore, in all four experiments, the test of grammatical structures occurred five minutes after the initial exposure to the artificial grammar had ended, suggesting that the infants were able to maintain the grammatical abstractions they had learned even after a short delay.

In a similar study, Saffran found that adults and older children (first- and second grade children) were also sensitive to syntactical information after exposure to an artificial language which had no cues to phrase structure other than the statistical regularities that were present. Both adults and children were able to pick out sentences that were ungrammatical at a rate greater than chance, even under an "incidental" exposure condition in which participants' primary goal was to complete a different task while hearing the language.

Although the number of studies dealing with statistical learning of syntactical information is limited, the available evidence does indicate that the statistical learning mechanisms are likely a contributing factor to children's ability to learn their language.

Statistical learning in bilingualism
Much of the early work using statistical learning paradigms focused on the ability for children or adults to learn a single language, consistent with the process of language acquisition for monolingual speakers or learners. However, it is estimated that approximately 60-75% of people in the world are bilingual. More recently, researchers have begun looking at the role of statistical learning for those who speak more than one language. Although there are no reviews on this topic yet, Weiss, Gerfen, and Mitchel examined how hearing input from multiple artificial languages simultaneously can affect the ability to learn either or both languages. Over four experiments, Weiss et al. found that, after exposure to two artificial languages, adult learners are capable of determining word boundaries in both languages when each language is spoken by a different speaker. However, when the two languages were spoken by the same speaker, participants were able learn both languages only when they were "congruent"—when the word boundaries of one language matched the word boundaries of the other. When the languages were incongruent—a syllable that appeared in the middle of a word in one language appeared at the end of the word in the other language—and spoken by a single speaker, participants were able to learn, at best, one of the two languages. A final experiment showed that the inability to learn incongruent languages spoken in the same voice was not due to syllable overlap between the languages but due to differing word boundaries.

Similar work replicates the finding that learners are able to learn two sets of statistical representations when an additional cue is present (two different male voices in this case). In their paradigm, the two languages were presented consecutively, rather than interleaved as in Weiss et al.'s paradigm, and participants did learn the first artificial language to which they had been exposed better than the second, although participants' performance was above chance for both languages.

While statistical learning improves and strengthens multilingualism, it appears that the inverse is not true. In a study by Yim and Rudoy it was found that both monolingual and bilingual children perform statistical learning tasks equally well.

Antovich and Graf Estes found that 14-month-old bilingual children are better than monolinguals at segmenting two different artificial languages using transitional probability cues. They suggest that a bilingual environment in early childhood trains children to rely on statistical regularities to segment the speech flow and access two lexical systems.

Word-referent mapping
A statistical learning mechanism has also been proposed for learning the meaning of words. Specifically, Yu and Smith conducted a pair of studies in which adults were exposed to pictures of objects and heard nonsense words. Each nonsense word was paired with a particular object. There were 18 total word-referent pairs, and each participant was presented with either 2, 3, or 4 objects at a time, depending on the condition, and heard the nonsense word associated with one of those objects. Each word-referent pair was presented 6 times over the course of the training trials; after the completion of the training trials, participants completed a forced-alternative test in which they were asked to choose the correct referent that matched a nonsense word they were given. Participants were able to choose the correct item more often than would happen by chance, indicating, according to the authors, that they were using statistical learning mechanisms to track co-occurrence probabilities across training trials.

An alternative hypothesis is that learners in this type of task may be using a "propose-but-verify" mechanism rather than a statistical learning mechanism. Medina et al. and Trueswell et al. argue that, because Yu and Smith only tracked knowledge at the end of the training, rather than tracking knowledge on a trial-by-trial basis, it is impossible to know if participants were truly updating statistical probabilities of co-occurrence (and therefore maintaining multiple hypotheses simultaneously), or if, instead, they were forming a single hypothesis and checking it on the next trial. For example, if a participant is presented with a picture of a dog and a picture of a shoe, and hears the nonsense word vash she might hypothesize that vash refers to the dog. On a future trial, she may see a picture of a shoe and a picture of a door and again hear the word vash. If statistical learning is the mechanism by which word-referent mappings are learned, then the participant would be more likely to select the picture of the shoe than the door, as shoe would have appeared in conjunction with the word vash 100% of the time. However, if participants are simply forming a single hypothesis, they may fail to remember the context of the previous presentation of vash (especially if, as in the experimental conditions, there are multiple trials with other words in between the two presentations of vash) and therefore be at chance in this second trial. According to this proposed mechanism of word learning, if the participant had correctly guessed that vash referred to the shoe in the first trial, her hypothesis would be confirmed in the subsequent trial.

To distinguish between these two possibilities, Trueswell et al. conducted a series of experiments similar to those conducted by Yu and Smith except that participants were asked to indicate their choice of the word-referent mapping on each trial, and only a single object name was presented on each trial (with varying numbers of objects). Participants would therefore have been at chance when they are forced to make a choice in their first trial. The results from the subsequent trials indicate that participants were not using a statistical learning mechanism in these experiments, but instead were using a propose-and-verify mechanism, holding only one potential hypothesis in mind at a time. Specifically, if participants had chosen an incorrect word-referent mapping in an initial presentation of a nonsense word (from a display of five possible choices), their likelihood of choosing the correct word-referent mapping in the next trial of that word was still at chance, or 20%. If, though, the participant had chosen the correct word-referent mapping on an initial presentation of a nonsense word, the likelihood of choosing the correct word-referent mapping on the subsequent presentation of that word was approximately 50%. These results were also replicated in a condition where participants were choosing between only two alternatives. These results suggest that participants did not remember the surrounding context of individual presentations and were therefore not using statistical cues to determine the word-referent mappings. Instead, participants make a hypothesis regarding a word-referent mapping and, on the next presentation of that word, either confirm or reject the hypothesis accordingly.

Overall, these results, along with similar results from Medina et al., indicate that word meanings may not be learned through a statistical learning mechanism in these experiments, which ask participants to hypothesize a mapping even on the first occurrence (i.e., not cross-situationally). However, when the propose-but-verify mechanism has been compared to a statistical learning mechanism, the former was unable to reproduce individual learning trajectories nor fit as well as the latter.

Need for social interaction
Additionally, statistical learning by itself cannot account even for those aspects of language acquisition for which it has been shown to play a large role. For example, Kuhl, Tsao, and Liu found that young English-learning infants who spent time in a laboratory session with a native Mandarin speaker were able to distinguish between phonemes that occur in Mandarin but not in English, unlike infants who were in a control condition. Infants in this control condition came to the lab as often as infants in the experimental condition, but were exposed only to English; when tested at a later date, they were unable to distinguish the Mandarin phonemes. In a second experiment, the authors presented infants with audio or audiovisual recordings of Mandarin speakers and tested the infants' ability to distinguish between the Mandarin phonemes. In this condition, infants failed to distinguish the foreign language phonemes. This finding indicates that social interaction is a necessary component of language learning and that, even if infants are presented with the raw data of hearing a language, they are unable to take advantage of the statistical cues present in that data if they are not also experiencing the social interaction.

Domain generality
Although the phenomenon of statistical learning was first discovered in the context of language acquisition and there is much evidence of its role in that purpose, work since the original discovery has suggested that statistical learning may be a domain general skill and is likely not unique to humans. For example, Saffran, Johnson, Aslin, and Newport found that both adults and infants were able to learn statistical probabilities of "words" created by playing different musical tones (i.e., participants heard the musical notes D, E, and F presented together during training and were able to recognize those notes as a unit at test as compared to three notes that had not been presented together). In non-auditory domains, there is evidence that humans are able to learn statistical visual information whether that information is presented across space, e.g., or time, e.g.,. Evidence of statistical learning has also been found in other primates, e.g., and some limited statistical learning abilities have been found even in non-primates like rats. Together these findings suggest that statistical learning may be a generalized learning mechanism that happens to be utilized in language acquisition, rather than a mechanism that is unique to the human infant's ability to learn his or her language(s).

Further evidence for domain general statistical learning was suggested in a study run through the University of Cornell Department of Psychology concerning visual statistical learning in infancy. Researchers in this study questioned whether domain generality of statistical learning in infancy would be seen using visual information. After first viewing images in statistically predictable patterns, infants were then exposed to the same familiar patterns in addition to novel sequences of the same identical stimulus components. Interest in the visuals was measured by the amount of time the child looked at the stimuli in which the researchers named "looking time". All ages of infant participants showed more interest in the novel sequence relative to the familiar sequence. In demonstrating a preference for the novel sequences (which violated the transitional probability that defined the grouping of the original stimuli) the results of the study support the likelihood of domain general statistical learning in infancy.