User:Sophie Pellerin/sandbox

Speech Perception Speech perception is the process by which the sounds of language are heard, interpreted and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching. In most cases, speech perception is an audiovisual process. The process of perceiving speech begins at the level of the sound signal and with the process of audition. (For a complete description of the process of audition see Hearing.) When speech is presented in an audiovisual modality, listeners receive visual speech information (e.g., from a speaker’s lip movements, facial expressions, and emotional expressions) and auditory speech information (speech sounds). A model of audiovisual speech perception proposed by Grant and colleagues (1998) suggests that after the initial auditory and visual signals are processed, speech sounds are further processed to extract acoustic cues and phonemic information (information about speech sounds) and visual information is further processed to extract visemes (what a speaker’s articulators look like during the production of speech sounds). These two sources of information are then combined together and processes that help determining the meaning and structure of the information in the speech signal are applied by the listeners to the speech information. Contextual information (that helps generating expectations about the words that will follow) helps with the application of these processes. The speech information provided by a speech signal can then be used for higher-level language processes, such as word recognition. Listeners’ language and memory abilities are involved throughout audiovisual speech perception.

Cross-Language and Second-Language A large amount of research has studied how users of a language perceive foreign speech (referred to as cross-language speech perception) or second-language speech (second-language speech perception). The latter falls within the domain of second language acquisition. Languages differ in their phonemic (sound) inventories. For this reason, the phonemic knowledge of a non-native speaker is generally less developed than the phonemic knowledge of a native speaker. Naturally, this creates difficulties when a foreign or non-native language is encountered. For example, if two foreign/non-native language sounds are assimilated to a single mother-tongue category, the difference between them will be very difficult to discern. A classic example of this situation is the observation that Japanese learners of English will have problems with identifying or distinguishing English liquid consonants /l/ and /r/ (see Perception of English /r/ and /l/ by Japanese speakers). ] Seeing a speaker’s lip movements helps non-native listeners perceiving phonemic contrasts (differences between two sounds) that are not present in their native language. This occurs because the combination of auditory and visual speech information results in an enhanced (and thus easier to understand) speech signal. Navarra and Soto-Faraco (2007) studied the perception of the /ɛ/ - /e/ contrast in Spanish-Catalan bilinguals. They found that when stimuli were presented in the auditory modality only, Spanish-dominant bilinguals (a language in which the /ɛ/ - /e/ phonemic contrast is absent) did not distinguish between /ɛ/ and /e/ whereas Catalan-dominant bilinguals (a language in which the /ɛ/ - /e/ phonemic contrast exists), distinguished these two sounds. However, when stimuli containing /ɛ/ and /e/ were presented with both auditory and visual speech information, both bilingual groups perceived the phonemic contrast. Best (1995) proposed a Perceptual Assimilation Model to explain how listeners classify second language (L2) phonemic contrasts. The model suggests that listeners classify the phonemic contrasts of their L2 according to how similar and dissimilar the sounds in these phonemic contrasts are to other sounds in their L1 and L2. These classifications provide an explanation for how L2 phonemic contrasts are incorporated into an individual’s L1 phonemic categories. This model outlines three possible ways in which L2 phonemic contrasts are incorporated into L1 phonemic categories. The “Two-Category” explanation refers to sounds in a L2 phonemic contrast that are being incorporated into two different L1 phonemic categories. In incorporations that fit the “Category Goodness” explanation, each sound in a L2 phonemic contrast gets incorporated into the same L1 phonemic category but one of the L2 sounds is more similar to the sounds in the L1 phonemic category than the other. In incorporations that fit the “Single Category” explanation, both sounds in a L2 phonemic contrast get incorporated into a single L1 phonemic category and both sounds are equally different from the sounds in the L1 phonemic category. Morphology, lexicon, semantics, and syntax also differ between languages. As it is the case for phonemes, non-native listeners generally have a less developed knowledge of these aspects of their non-native language than native listeners. This will lead to speech perception being more difficult in a L2 than in a L1. The extent to which speech perception will be difficult for non-native listeners will depend on the age at which they learned the non-native language, the amount of exposure they have to the language, and on their proficiency in the language. Generally, individuals who acquire a non-native language earlier in life (and especially during the critical period for second language acquisition) are more likely achieve a native-like proficiency in that language. Evidence also suggests that early bilinguals perceive speech in their L2 more accurately than late bilinguals, especially in situations in which speech is presented with background noise.

Noise and Masking One of the fundamental problems in the study of speech perception is how to deal with noise. Multiple types of noise and maskers can make speech perception more difficult for listeners. The perception of speech presented with background noise and/or masking will require the separation of the target speech signal from the masking elements and/or the background noise. The masking of the speech signal with a background noise from which no meaningful information can be picked up, such as babble made of multiple individuals talking at once is known as energetic masking. Energetic masking makes auditory speech cues that would help with the identification of speech segments and their boundaries unavailable. This type of masking also makes prosodic cues (which can also facilitate speech perception) less accessible to listeners. Informational masking can also hinder listeners’ ability to perceive speech accurately. This type of masking occurs when the meaningful information contained in a competing signal/background noise can be clearly understood by listeners and interferes with the understanding of the information in the target speech signal. Background noise consisting of single individual speaking in a listener’s native language will result in this type of masking. Devices used to transmit and perceive speech such as telephones also degrade the quality of the speech signal. In fact, frequencies below 400 Hz and above 3,400 Hz are typically filtered-out by telephone transmission.11 Since, most human speech is produced at frequencies between 100 Hz and 5000 Hz, some of the frequencies at which human speech is produced are filtered-out by telephone transmission and therefore, cannot be perceived by listeners. This can lead to less accurate perception of speech produced with frequencies in the range that is filtered-out. Reverberation can also reduce the intelligibility of a speech signal. It occurs when a sound reaches a listener indirectly, after being reflected on multiple surfaces. When reverberation occurs, the sound reaching a listener will have additional energy (reverberant energy), which will degrade portions of the speech signal. Reverberation masks speech segments (individual sounds or words) and their boundaries as well as prosodic information and degrades consonants to a greater extent than vowels. Speech perception in noise can be facilitated by visual speech cues and semantic context. 11, 12 When phonemic information is masked by noise, listeners can use the visemes provided by the speaker’s lip movements to disambiguate the auditory information they get from the speech signal. Visemes provide information that can complement the phonemic information listeners receive from the speech signal because the information provided visemes is not necessarily perceivable through the phonemic information (e.g., the place of articulation of a speech sound). When semantic contextual information is included in speech and can be perceived at least to some extent, listeners will be able to generate expectations about the type of information that will next be provided based on the meaning of the previous content. This can help listeners generate hypotheses about what is being said when they cannot understand the totality of the information in the speech signal. Visual speech cues can also alter speech perception. This happens with the McGurk effect. This effect occurs when listeners are presented with incongruent visual and auditory speech cues (e.g., /ga/ is presented visually while /ba/ is presented in the auditory modality).12 This type of presentation will result in listeners perceiving a sound that is the fusion of the stimuli presented in the auditory and visual modalities (the combination of /ga/ and /ba/ generally results in listeners perceiving /da/) (see McGurk effect). The difficulties that might accompany speech perception in noise can be demonstrated by the difficulty in recognizing human speech that computer recognition systems have. These systems can do well at recognizing speech if trained on a specific speaker's voice and under quiet conditions, but they often do poorly in more realistic listening situations where humans would understand speech relatively easily. . To emulate processing patterns that would be held in the brain under normal conditions, prior knowledge is a key neural factor, since a robust learning history may, to an extent, override the extreme masking effects that accompany the complete absence of continuous speech signals.

The Role of Working Memory When engaging in speech perception, listeners must hold and manipulate the auditory and visual information provided by the speech signal in their working memory. Because working memory capacity is limited, an individual’s ability to hold and manipulate speech information in their working memory will also be limited. This will influence the amount of information from the speech signal they can use to perceive speech, and ultimately, how accurately they perceive speech. When speech perception occurs in noisy conditions, the demands placed on working memory increase. Rönnberg, Rudner, Foo, and Lunner (2008) proposed the Ease of Language Understanding Model (ELU) to explain this phenomenon. This model is based on the premise that when speech presented in multiple modalities (e.g., auditory and visual) under optimal listening conditions, (no background noise or masking, in a native language, and with perfect articulation) phonology, semantics, syntax, and prosody from the speech signal are combined rapidly and automatically to form a stream of phonological information (called RAMBPHO). In these optimal conditions, the information provided by the speech signal will be easily matched with the phonological representations stored in a listener’s long-term memory. However, under sub-optimal listening conditions (e.g., in the presence of background noise), there is an increased probability that there will be a mismatch between the RAMBPHO and the phonological representations stored in a listener’s long-term memory. In order to resolve this mismatch, the listener must consciously process the information provided by the speech signal and store and manipulate it in their working memory to make it match with the phonological representations stored in their long-term memory. This will be more difficult for listeners with a low working memory capacity, and will likely result in them perceiving speech less accurately.

Older Adults Older adults often find speech perception more challenging than younger individuals, especially in sub-optimal listening conditions (e.g., in the presence of background noise). Declines in sensory (e.g., hearing) and cognitive abilities (e.g., working memory capacity) that accompany aging provide at least part of an explanation as to why this is the case. Most speech information is presented in the auditory modality, which means that declines in hearing ability will likely lead to portions of the speech signal not being perceived accurately or not being perceived at all. To compensate for this decline in hearing ability, older adults seem to rely increasingly on visual speech cues to get information from the speech signal. This can be explained by the inverse effectiveness hypothesis, which posits that when the quality of information presented in one sensory modality is decreased, information provided in another sensory modality is relied upon to a greater extent, and that greater multisensory integration occurs in these circumstances. Despite potentially relying on visual speech cues to a greater extent, older adults benefit from these cues to a similar extent as young adults when engaging in speech perception. This means that the ability to combine information presented in multiple sensory modalities is similar in adults of different ages despite different degrees of reliance on visual speech cues. The ability to combine auditory and visual speech information is also similar for older adults with age-normative hearing (who experience only moderate declines in hearing as a result of age) and those with age-related hearing loss (who experience more severe declines in hearing as a result of age). With age, working memory capacity declines (see Working memory). This makes speech perception more difficult for older adults because their capacity to store and use the information they collect from the speech signal is reduced. This means that some information from the speech signal will be lost because it cannot be stored and manipulated in a listener’s working memory. This is especially true if speech is presented with background noise because of the increased demands these listening conditions place on working memory. Verbal abilities remain intact (and improve for some individuals) in old age. Therefore, older adults can draw on these abilities when they engage in speech perception to compensate for declines in sensory and/or cognitive abilities. Because of these intact verbal abilities, older adults are capable of using semantic context to perceive speech more accurately.