Articulatory synthesis

[[File:Modeling-Consonant-Vowel-Coarticulation-for-Articulatory-Speech-Synthesis-pone.0060603.s008.ogv|thumb|310px|3D vocal tract model for Articulatory synthesis

Based on Consonant-Vowel Coarticulation modeling, German sentence "Lea und Doreen mögen Bananen." was reproduced from a naturally spoken sentence in terms of the fundamental frequency and the phone durations. ]]

Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. The shape of the vocal tract can be controlled in a number of ways which usually involves modifying the position of the speech articulators, such as the tongue, jaw, and lips. Speech is created by digitally simulating the flow of air through the representation of the vocal tract.

Mechanical talking heads
There is a long history of attempts to build mechanical "talking heads". Gerbert (d. 1003), Albertus Magnus (1198–1280) and Roger Bacon (1214–1294) are all said to have built speaking heads (Wheatstone 1837). However, historically confirmed speech synthesis begins with Wolfgang von Kempelen (1734–1804), who published an account of his research in 1791 (see also ).

Electrical vocal tract analogs
The first electrical vocal tract analogs were static, like those of Dunn (1950), Ken Stevens and colleagues (1953), Gunnar Fant (1960). Rosen (1958) built a dynamic vocal tract (DAVO), which Dennis (1963) later attempted to control by computer. Dennis et al. (1964), Hiki et al. (1968) and Baxter and Strong (1969) have also described hardware vocal-tract analogs. Kelly and Lochbaum (1962) made the first computer simulation; later digital computer simulations have been made, e.g. by Nakata and Mitsuoka (1965), Matsui (1968) and Paul Mermelstein (1971). Honda et al. (1968) have made an analog computer simulation.

Haskins and Maeda models
The first software articulatory synthesizer regularly used for laboratory experiments was developed at Haskins Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was a computational model of speech production based on vocal tract models developed at Bell Laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues. Another popular model that has been frequently used is that of Shinji Maeda, which uses a factor-based approach to control tongue shape.

Modern models
Recent progress in speech production imaging, articulatory control modeling, and tongue biomechanics modeling has led to changes in the way articulatory synthesis is performed. Examples include the Haskins CASY model (Configurable Articulatory Synthesis), designed by Philip Rubin, Mark Tiede, and Louis Goldstein , which matches midsagittal vocal tracts to actual magnetic resonance imaging (MRI) data, and uses MRI data to construct a 3D model of the vocal tract. A full 3D articulatory synthesis model has been described by Olov Engwall. A geometrically based 3D articulatory speech synthesizer has been developed by Peter Birkholz (VocalTractLab ). The Directions Into Velocities of Articulators (DIVA) model, a feedforward control approach which takes the neural computations underlying speech production into consideration, was developed by Frank H. Guenther at Boston University. The ArtiSynth project, headed by Sidney Fels at the University of British Columbia, is a 3D biomechanical modeling toolkit for the human vocal tract and upper airway. Biomechanical modeling of articulators such as the tongue has been pioneered by a number of scientists, including Reiner Wilhelms-Tricarico, Yohan Payan and Jean-Michel Gerard , Jianwu Dang and Kiyoshi Honda.

Commercial models
One of the few commercial articulatory speech synthesis systems is the NeXT-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the University of Calgary, where much of the original research was conducted. Following the demise of the various incarnations of NeXT (started by Steve Jobs in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under a GNU General Public Licence, with work continuing as gnuspeech. The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Rene Carré's "distinctive region model".