Origin of speech

The origin of speech differs from the origin of language because language is not necessarily spoken; it could equally be written or signed. Speech is a fundamental aspect of human communication and plays a vital role in the everyday lives of humans. It allows them to convey thoughts, emotions, and ideas, and providing the ability to connect with others and shape collective reality.

Many attempts have been made to explain scientifically how speech emerged in humans, although to date no theory has generated agreement.

Non-human primates, like many other animals, have evolved specialized mechanisms for producing sounds for purposes of social communication. On the other hand, no monkey or ape uses its tongue for such purposes. The human species' unprecedented use of the tongue, lips and other moveable parts seems to place speech in a quite separate category, making its evolutionary emergence an intriguing theoretical challenge in the eyes of many scholars.

Modality-independence
The term modality means the chosen representational format for encoding and transmitting information. A striking feature of language is that it is modality-independent. Should an impaired child be prevented from hearing or producing sound, its innate capacity to master a language may equally find expression in signing. Sign languages of the deaf are independently invented and have all the major properties of spoken language except for the modality of transmission. From this it appears that the language centres of the human brain must have evolved to function optimally, irrespective of the selected modality.



Animal communication systems routinely combine visible with audible properties and effects, but none is modality-independent. For example, no vocally-impaired whale, dolphin, or songbird could express its song repertoire equally in visual display. Indeed, in the case of animal communication, message and modality are not capable of being disentangled. Whatever message is being conveyed stems from the intrinsic properties of the signal.

Modality independence should not be confused with the ordinary phenomenon of multimodality. Monkeys and apes rely on a repertoire of species-specific "gesture-calls" – emotionally-expressive vocalisations inseparable from the visual displays which accompany them. Humans also have species-specific gesture-calls – laughs, cries, sobs, etc. – together with involuntary gestures accompanying speech. Many animal displays are polymodal in that each appears designed to exploit multiple channels simultaneously.

The human linguistic property of modality independence is conceptually distinct from polymodality. It allows the speaker to encode the informational content of a message in a single channel whilst switching between channels as necessary. Modern city-dwellers switch effortlessly between the spoken word and writing in its various forms – handwriting, typing, email, etc. Whichever modality is chosen, it can reliably transmit the full message content without external assistance of any kind. When talking on the telephone, for example, any accompanying facial or manual gestures, however natural to the speaker, are not strictly necessary. When typing or manually signing, conversely, there is no need to add sounds. In many Australian Aboriginal cultures, a section of the population – perhaps women observing a ritual taboo – traditionally restrict themselves for extended periods to a silent (manually-signed) version of their language. Then, when released from the taboo, these same individuals resume narrating stories by the fireside or in the dark, switching to pure sound without sacrifice of informational content.

Evolution of the speech organs
Speaking is the default modality for language in all cultures. Humans' first recourse is to encode their thoughts in sound – a method which depends on sophisticated capacities for controlling the lips, tongue and other components of the vocal apparatus.

The speech organs evolved in the first instance not for speech but for more basic bodily functions such as feeding and breathing. Nonhuman primates have broadly similar organs, but with different neural controls. Non-human apes use their highly-flexible, maneuverable tongues for eating but not for vocalizing. When an ape is not eating, fine motor control over its tongue is deactivated. Either it is performing gymnastics with its tongue or it is vocalising; it cannot perform both activities simultaneously. Since this applies to mammals in general, Homo sapiens are exceptional in harnessing mechanisms designed for respiration and ingestion for the radically different requirements of articulate speech.

Possible semi-aquatic adaptations
Recent insights in human evolution – more specifically, human Pleistocene littoral evolution – may help understand how human speech evolved. One controversial suggestion is that certain pre-adaptations for spoken language evolved during a time when ancestral hominins lived close to river banks and lake shores rich in fatty acids and other brain-specific nutrients. Occasional wading or swimming may also have led to enhanced breath-control (breath-hold diving).

Independent lines of evidence suggest that "archaic" Homo spread intercontinentally along the Indian Ocean shores (they even reached overseas islands such as Flores) where they regularly dived for littoral foods such as shell- and crayfish, which are extremely rich in brain-specific nutrients, explaining Homo's brain enlargement. Shallow diving for seafoods requires voluntary airway control, a prerequisite for spoken language. Seafood such as shellfish generally does not require biting and chewing, but stone tool use and suction feeding. This finer control of the oral apparatus was arguably another biological pre-adaptation to human speech, especially for the production of consonants.

Tongue
The word "language" derives from the Latin lingua, "tongue". Phoneticians agree that the tongue is the most important speech articulator, followed by the lips. A natural language can be viewed as a particular way of using the tongue to express thought.

The human tongue has an unusual shape. In most mammals, it is a long, flat structure contained largely within the mouth. It is attached at the rear to the hyoid bone, situated below the oral level in the pharynx. In humans, the tongue has an almost circular sagittal (midline) contour, much of it lying vertically down an extended pharynx, where it is attached to a hyoid bone in a lowered position. Partly as a result of this, the horizontal (inside-the-mouth) and vertical (down-the-throat) tubes forming the supralaryngeal vocal tract (SVT) are almost equal in length (whereas in other species, the vertical section is shorter). As humans move their jaws up and down, the tongue can vary the cross-sectional area of each tube independently by about 10:1, altering formant frequencies accordingly. That the tubes are joined at a right angle permits pronunciation of the vowels [i], [u] and [a], which nonhuman primates cannot do. Even when not performed particularly accurately, in humans the articulatory gymnastics needed to distinguish these vowels yield consistent, distinctive acoustic results, illustrating the quantal nature of human speech sounds. It may not be coincidental that [i], [u] and [a] are the most common vowels in the world's languages. Human tongues are a lot shorter and thinner than other mammals and are composed of a large number of muscles, which helps shape a variety of sounds within the oral cavity. The diversity of sound production is also increased with the human’s ability to open and close the airway, allowing varying amounts of air to exit through the nose. The fine motor movements associated with the tongue and the airway, make humans more capable of producing a wide range of intricate shapes in order to produce sounds at different rates and intensities.

Lips
In humans, the lips are important for the production of stops and fricatives, in addition to vowels. Nothing, however, suggests that the lips evolved for those reasons. During primate evolution, a shift from nocturnal to diurnal activity in tarsiers, monkeys and apes (the haplorhines) brought with it an increased reliance on vision at the expense of olfaction. As a result, the snout became reduced and the rhinarium or "wet nose" was lost. The muscles of the face and lips consequently became less constrained, enabling their co-option to serve purposes of facial expression. The lips also became thicker, and the oral cavity hidden behind became smaller. Hence, according to Ann MacLarnon, "the evolution of mobile, muscular lips, so important to human speech, was the exaptive result of the evolution of diurnality and visual communication in the common ancestor of haplorhines". It is unclear whether human lips have undergone a more recent adaptation to the specific requirements of speech.

Respiratory control
Compared with nonhuman primates, humans have significantly enhanced control of breathing, enabling exhalations to be extended and inhalations shortened as we speak. Whilst we are speaking, intercostal and interior abdominal muscles are recruited to expand the thorax and draw air into the lungs, and subsequently to control the release of air as the lungs deflate. The muscles concerned are markedly more innervated in humans than in nonhuman primates. Evidence from fossil hominins suggests that the necessary enlargement of the vertebral canal, and therefore spinal cord dimensions, may not have occurred in Australopithecus or Homo erectus but was present in the Neanderthals and early modern humans.

Larynx


The larynx or voice box is an organ in the neck housing the vocal folds, which are responsible for phonation. In humans, the larynx is descended, it is positioned lower than in other primates. This is because the evolution of humans to an upright position shifted the head directly above the spinal cord, forcing everything else downward. The repositioning of the larynx resulted in a longer cavity called the pharynx, which is responsible for increasing the range and clarity of the sound being produced. Other primates have almost no pharynx; therefore, their vocal power is significantly lower. Humans are not unique in this respect: goats, dogs, pigs and tamarins lower the larynx temporarily, to emit loud calls. Several deer species have a permanently lowered larynx, which may be lowered still further by males during their roaring displays. Lions, jaguars, cheetahs and domestic cats also do this. However, laryngeal descent in nonhumans (according to Philip Lieberman) is not accompanied by descent of the hyoid; hence the tongue remains horizontal in the oral cavity, preventing it from acting as a pharyngeal articulator.

Despite all this, scholars remain divided as to how "special" the human vocal tract really is. It has been shown that the larynx does descend to some extent during development in chimpanzees, followed by hyoidal descent. As against this, Philip Lieberman points out that only humans have evolved permanent and substantial laryngeal descent in association with hyoidal descent, resulting in a curved tongue and two-tube vocal tract with 1:1 proportions. Uniquely in the human case, simple contact between the epiglottis and velum is no longer possible, disrupting the normal mammalian separation of the respiratory and digestive tracts during swallowing. Since this entails substantial costs – increasing the risk of choking whilst swallowing food – we are forced to ask what benefits might have outweighed those costs. Some claim the clear benefit must have been speech, but other contest this. One objection is that humans are in fact not seriously at risk of choking on food: medical statistics indicate that accidents of this kind are extremely rare. Another objection is that in the view of most scholars, speech as we know it emerged relatively late in human evolution, roughly contemporaneously with the emergence of Homo sapiens. A development as complex as the reconfiguration of the human vocal tract would have required much more time, implying an early date of origin. This discrepancy in timescales undermines the idea that human vocal flexibility was initially driven by selection pressures for speech.

At least one orangutan has demonstrated the ability to control the voice box.

The size exaggeration hypothesis
To lower the larynx is to increase the length of the vocal tract, in turn lowering formant frequencies so that the voice sounds "deeper" – giving an impression of greater size. John Ohala argued that the function of the lowered larynx in humans, especially males, is probably to enhance threat displays rather than speech itself. Ohala pointed out that if the lowered larynx were an adaptation for speech, we would expect adult human males to be better adapted in this respect than adult females, whose larynx is considerably less low. In fact, females invariably outperform males in verbal tests, falsifying this whole line of reasoning. William Tecumseh Fitch likewise argues that this was the original selective advantage of laryngeal lowering in humans. Although, according to Fitch, the initial lowering of the larynx in humans had nothing to do with speech, the increased range of possible formant patterns was subsequently co-opted for speech. Size exaggeration remains the sole function of the extreme laryngeal descent observed in male deer. Consistent with the size exaggeration hypothesis, a second descent of the larynx occurs at puberty in humans, although only in males. In response to the objection that the larynx is descended in human females, Fitch suggests that mothers vocalising to protect their infants would also have benefited from this ability.

Neanderthal speech


Most specialists credit the Neanderthals with speech abilities not radically different from those of modern Homo sapiens. An indirect line of argument is that their toolmaking and hunting tactics would have been difficult to learn or execute without some kind of speech. A recent extraction of DNA from Neanderthal bones indicates that Neanderthals had the same version of the FOXP2 gene as modern humans. This gene, mistakenly described as the "grammar gene", plays a role in controlling the orofacial movements which (in modern humans) are involved in speech.

During the 1970s, it was widely believed that the Neanderthals lacked modern speech capacities. It was claimed that they possessed a hyoid bone so high up in the vocal tract as to preclude the possibility of producing certain vowel sounds.

The hyoid bone is present in many mammals. It allows a wide range of tongue, pharyngeal and laryngeal movements by bracing these structures alongside each other in order to produce variation. It is now realised that its lowered position is not unique to Homo sapiens, whilst its relevance to vocal flexibility may have been overstated: although men have a lower larynx, they do not produce a wider range of sounds than women or two-year-old babies. There is no evidence that the larynx position of the Neanderthals impeded the range of vowel sounds they could produce. The discovery of a modern-looking hyoid bone of a Neanderthal man in the Kebara Cave in Israel led its discoverers to argue that the Neanderthals had a descended larynx, and thus human-like speech capabilities. However, other researchers have claimed that the morphology of the hyoid is not indicative of the larynx's position. It is necessary to take into consideration the skull base, the mandible, the cervical vertebrae and a cranial reference plane.

The morphology of the outer and middle ear of Middle Pleistocene hominins from Atapuerca, Spain, believed to be proto-Neanderthal, suggests they had an auditory sensitivity similar to modern humans and very different from chimpanzees. They were probably able to differentiate between many different speech sounds.

Hypoglossal canal
The hypoglossal nerve plays an important role in controlling movements of the tongue. In 1998, a research team used the size of the hypoglossal canal in the base of fossil skulls in an attempt to estimate the relative number of nerve fibres, claiming on this basis that Middle Pleistocene hominins and Neanderthals had more fine-tuned tongue control than either Australopithecines or apes. Subsequently, however, it was demonstrated that hypoglossal canal size and nerve sizes are not correlated, and it is now accepted that such evidence is uninformative about the timing of human speech evolution.

Distinctive features theory
According to one influential school, the human vocal apparatus is intrinsically digital on the model of a keyboard or digital computer (see below). Nothing about a chimpanzee's vocal apparatus suggests a digital keyboard, notwithstanding the anatomical and physiological similarities. This poses the question as to when and how, during the course of human evolution, the transition from analog to digital structure and function occurred.

The human supralaryngeal tract is said to be digital in the sense that it is an arrangement of moveable toggles or switches, each of which, at any one time, must be in one state or another. The vocal cords, for example, are either vibrating (producing a sound) or not vibrating (in silent mode). By virtue of simple physics, the corresponding distinctive feature – in this case, "voicing" – cannot be somewhere in between. The options are limited to "off" and "on". Equally digital is the feature known as "nasalisation". At any given moment the soft palate or velum either allows or does not allow sound to resonate in the nasal chamber. In the case of lip and tongue positions, more than two digital states may be allowed.

The theory that speech sounds are composite entities constituted by complexes of binary phonetic features was first advanced in 1938 by the Russian linguist Roman Jakobson. A prominent early supporter of this approach was Noam Chomsky, who went on to extend it from phonology to language more generally, in particular to the study of syntax and semantics. In his 1965 book, Aspects of the Theory of Syntax, Chomsky treated semantic concepts as combinations of binary-digital atomic elements explicitly on the model of distinctive features theory. The lexical item "bachelor", on this basis, would be expressed as [+ Human], [+ Male], [- Married].

Supporters of this approach view the vowels and consonants recognised by speakers of a particular language or dialect at a particular time as cultural entities of little scientific interest. From a natural science standpoint, the units which matter are those common to Homo sapiens by virtue of biological nature. By combining the atomic elements or "features" with which all humans are innately equipped, anyone may in principle generate the entire range of vowels and consonants to be found in any of the world's languages, whether past, present or future. The distinctive features are in this sense atomic components of a universal language.

Criticism
In recent years, the notion of an innate "universal grammar" underlying phonological variation has been called into question. The most comprehensive monograph ever written about speech sounds, The Sounds of the World's Languages, by Peter Ladefoged and Ian Maddieson, found virtually no basis for the postulation of some small number of fixed, discrete, universal phonetic features. Examining 305 languages, for example, they encountered vowels that were positioned basically everywhere along the articulatory and acoustic continuum. Ladefoged concluded that phonological features are not determined by human nature: "Phonological features are best regarded as artifacts that linguists have devised in order to describe linguistic systems".

Self-organisation theory


Self-organisation characterises systems where macroscopic structures are spontaneously formed out of local interactions between the many components of the system. In self-organised systems, global organisational properties are not to be found at the local level. In colloquial terms, self-organisation is roughly captured by the idea of "bottom-up" (as opposed to "top-down") organisation. Examples of self-organised systems range from ice crystals to galaxy spirals in the inorganic world.

According to many phoneticians, the sounds of language arrange and re-arrange themselves through self-organisation. Speech sounds have both perceptual (how one hears them) and articulatory (how one produces them) properties, all with continuous values. Speakers tend to minimise effort, favouring ease of articulation over clarity. Listeners do the opposite, favouring sounds that are easy to distinguish even if difficult to pronounce. Since speakers and listeners are constantly switching roles, the syllable systems actually found in the world's languages turn out to be a compromise between acoustic distinctiveness on the one hand, and articulatory ease on the other.

Agent-based computer models take the perspective of self-organisation at the level of the speech community or population. The two main paradigms are (1) the iterated learning model and (2) the language game model. Iterated learning focuses on transmission from generation to generation, typically with just one agent in each generation. In the language game model, a whole population of agents simultaneously produce, perceive and learn language, inventing novel forms when the need arises.

Several models have shown how relatively simple peer-to-peer vocal interactions, such as imitation, can spontaneously self-organise a system of sounds shared by the whole population, and different in different populations. For example, models elaborated by Berrah et al. (1996) and de Boer (2000), and recently reformulated using Bayesian theory, showed how a group of individuals playing imitation games can self-organise repertoires of vowel sounds which share substantial properties with human vowel systems. For example, in de Boer's model, initially vowels are generated randomly, but agents learn from each other as they interact repeatedly over time. Agent A chooses a vowel from her repertoire and produces it, inevitably with some noise. Agent B hears this vowel and chooses the closest equivalent from her own repertoire. To check whether this truly matches the original, B produces the vowel she thinks she has heard, whereupon A refers once again to her own repertoire to find the closest equivalent. If this matches the one she initially selected, the game is successful, otherwise, it has failed. "Through repeated interactions", according to de Boer, "vowel systems emerge that are very much like the ones found in human languages".

In a different model, the phonetician Björn Lindblom was able to predict, on self-organisational grounds, the favoured choices of vowel systems ranging from three to nine vowels on the basis of a principle of optimal perceptual differentiation.

Further models studied the role of self-organisation in the origins of phonemic coding and combinatoriality, which is the existence of phonemes and their systematic reuse to build structured syllables. Pierre-Yves Oudeyer developed models which showed that basic neural equipment for adaptive holistic vocal imitation, coupling directly motor and perceptual representations in the brain, can generate spontaneously shared combinatorial systems of vocalisations, including phonotactic patterns, in a society of babbling individuals. These models also characterised how morphological and physiological innate constraints can interact with these self-organised mechanisms to account for both the formation of statistical regularities and diversity in vocalisation systems.

Gestural theory
The gestural theory states that speech was a relatively late development, evolving by degrees from a system that was originally gestural. Human ancestors were unable to control their vocalisation at the time when gestures were used to communicate; however, as they slowly began to control their vocalisations, spoken language began to evolve.

Three types of evidence support this theory:
 * 1) Gestural language and vocal language depend on similar neural systems. The regions on the cortex that are responsible for mouth and hand movements border each other.
 * 2) Nonhuman primates minimise vocal signals in favour of manual, facial and other visible gestures in order to express simple concepts and communicative intentions in the wild. Some of these gestures resemble those of humans, such as the "begging posture", with the hands stretched out, which humans share with chimpanzees.
 * 3) Mirror Neurons

Research has found strong support for the idea that spoken language and signing depend on similar neural structures. Patients who used sign language, and who suffered from a left-hemisphere lesion, showed the same disorders with their sign language as vocal patients did with their oral language. Other researchers found that the same left-hemisphere brain regions were active during sign language as during the use of vocal or written language.

Humans spontaneously use hand and facial gestures when formulating ideas to be conveyed in speech. There are also, of course, many sign languages in existence, commonly associated with deaf communities; as noted above, these are equal in complexity, sophistication, and expressive power, to any oral language. The main difference is that the "phonemes" are produced on the outside of the body, articulated with hands, body, and facial expression, rather than inside the body articulated with tongue, teeth, lips, and breathing.

Many psychologists and scientists have looked into the mirror system in the brain to answer this theory as well as other behavioural theories. Evidence to support mirror neurons as a factor in the evolution of speech includes mirror neurons in primates, the success of teaching apes to communicate gesturally, and pointing/gesturing to teach young children language. Fogassi and Ferrari (2014) monitored motor cortex activity in monkeys, specifically area F5 in the Broca’s area, where mirror neurons are located. They observed changes in electrical activity in this area when the monkey executed or observed different hand actions performed by someone else. Broca’s area is a region in the frontal lobe responsible for language production and processing. The discovery of mirror neurons in this region, which fire when an action is done or observed specifically with the hand, strongly supports the belief that communication was once accomplished with gestures. The same is true when teaching young children language. When one points at a specific object or location, mirror neurons in the child fire as though they were doing the action, which results in long-term learning

Criticism
Critics note that for mammals in general, sound turns out to be the best medium in which to encode information for transmission over distances at speed. Given the probability that this applied also to early humans, it is hard to see why they should have abandoned this efficient method in favour of more costly and cumbersome systems of visual gesturing – only to return to sound at a later stage.

By way of explanation, it has been proposed that at a relatively late stage in human evolution, hands became so much in demand for making and using tools that the competing demands of manual gesturing became a hindrance. The transition to spoken language is said to have occurred only at that point. Since humans throughout evolution have been making and using tools, however, most scholars remain unconvinced by this argument. (For a different approach to this issue – one setting out from considerations of signal reliability and trust – see "from pantomime to speech" below).

Timeline of speech evolution
Little is known about the timing of language's emergence in the human species. Unlike writing, speech leaves no material trace, making it archaeologically invisible. Lacking direct linguistic evidence, specialists in human origins have resorted to the study of anatomical features and genes arguably associated with speech production. Whilst such studies may provide information as to whether pre-modern Homo species had speech capacities, it is still unknown whether they actually spoke. Whilst they may have communicated vocally, the anatomical and genetic data lack the resolution necessary to differentiate proto-language from speech.

Using statistical methods to estimate the time required to achieve the current spread and diversity in modern languages today, Johanna Nichols – a linguist at the University of California, Berkeley – argued in 1998 that vocal languages must have begun diversifying at least 100,000 years ago.

In 2012, anthropologists Charles Perreault and Sarah Mathew used phonemic diversity to suggest a date consistent with this. "Phonemic diversity" denotes the number of perceptually distinct units of sound – consonants, vowels and tones – in a language. The current worldwide pattern of phonemic diversity potentially contains the statistical signal of the expansion of modern Homo sapiens out of Africa, beginning around 60-70 thousand years ago. Some scholars argue that phonemic diversity evolves slowly and can be used as a clock to calculate how long the oldest African languages would have to have been around in order to accumulate the number of phonemes they possess today. As human populations left Africa and expanded into the rest of the world, they underwent a series of bottlenecks – points at which only a very small population survived to colonise a new continent or region. Allegedly such a population crash led to a corresponding reduction in genetic, phenotypic and phonemic diversity. African languages today have some of the largest phonemic inventories in the world, whilst the smallest inventories are found in South America and Oceania, some of the last regions of the globe to be colonised. For example, Rotokas, a language of New Guinea, and Pirahã, spoken in South America, both have just 11 phonemes, whilst !Xun, a language spoken in Southern Africa has 141 phonemes. The authors use a natural experiment – the colonization of mainland Southeast Asia on the one hand, the long-isolated Andaman Islands on the other – to estimate the rate at which phonemic diversity increases through time. Using this rate, they estimate that the world's languages date back to the Middle Stone Age in Africa, sometime between 350 thousand and 150 thousand years ago. This corresponds to the speciation event which gave rise to Homo sapiens.

These and similar studies have however been criticised by linguists who argue that they are based on a flawed analogy between genes and phonemes, since phonemes are frequently transferred laterally between languages unlike genes, and on a flawed sampling of the world's languages, since both Oceania and the Americas also contain languages with very high numbers of phonemes, and Africa contains languages with very few. They argue that the actual distribution of phonemic diversity in the world reflects recent language contact and not deep language history - since it is well demonstrated that languages can lose or gain many phonemes over very short periods. In other words, there is no valid linguistic reason to expect genetic founder effects to influence phonemic diversity.