Talk:Speech synthesis

Microsoft Sam glitch
Does anyone know why when you have MS Sam read "soi" or "soy" it makes a really odd airy sound? There are other errors but I can't remember —Preceding unsigned comment added by 24.156.28.128 (talk) 11:30, 13 September 2008 (UTC)

Older comments
I think perhaps the different synthesis techniques are long enough to warrant their own pages Nohat 05:20 19 Jun 2003 (UTC)

Could somebody extend on 'Formant synthesis', both on the technical side and terminology, is any system using filting technics on basic waves + noise considered 'formant' synthesis? is it a specific technique, or just the general term of synthesising phenomenons?

Trillium Sound Research Inc (now defunct) offered unlimited vocabulary articulatory speech synthesis on the NeXT computer in 1994, so it is not accurate to say that articulatory speech synthesis is only of academic interest and not far enough advanced for commercial application. It was NeXT Computer that failed, not the synthesis, which was rated the best synthesis available at the time. That software is now the basis of the GnuSpeech project -- a port of the original NeXT software to Linux. It is under a GPL. The basis is an acoustic tube model, so it is low level articulatory synthesis with the necessary databases for varying the tube cross-sections, using the Fant/Carre research on formant sensitivity analysis and control regions. Provision is made for adding the higher level parameters such as tongue height, jaw opening, etc, but this extension is still undeveloped, and would rely on deriving relationships between these higher-level parameters and the low-level tube cross-section parameters. Other ports are possible/likely.

use in Weatheradio
Not sure where to put it, but the National Weather Service in the U.S. uses it on all Weatheradio stations now. The new voice sounds excellent, and i think uses a hybrid of patched voice and true synthesis. The Weather Channel also may use this for their Vocal Local announcements during the local forecast (but not on their Weatherscan channel). –radiojon 02:47, 2004 Jun 4 (UTC)


 * The NWS "Tom" and "Donna" AKA "Mara" voices are the SpeechWorks Speechify (now merged with Realspeak) American English voices "Tom" and "Mara" (no longer available), which use a purely concatenative system. Nohat 06:57, 14 Apr 2005 (UTC)

Request for references
Hi, I am working to encourage implementation of the goals of the Verifiability policy. Part of that is to make sure articles cite their sources. This is particularly important for featured articles, since they are a prominent part of Wikipedia. The Fact and Reference Check Project has more information. If some of the external links are reliable sources and were used as references, they can be placed in a References section too. See the cite sources link for how to format them. Thank you, and please leave me a message when a few references have been added to the article. - Taxman 19:43, Apr 22, 2005 (UTC)

Early Voices Described as "Robotic" Seems Circular
Primitive speech synthesis devices sound robotic. A robotic voice is produced by a primitive speech synthesis device. This is circular. The popular idea of what a robot's voice sounds like comes from early attempts at speech synthesis. Film and television makers must have imitated what had been produced by early efforts at synthesis when creating robotic characters. Would be more accurate I think to say that the idea of a robotic voice came from efforts to produce speech synthesis. Saying that early speech synthesizers were robotic gets it backwards.


 * I'm not sure it's quite so simple as that. Interestingly, there has only ever been one speech synthesis system that spoke in a monotone (and not very popular or often-used one at that)—yet, the most common feature of "robotic" voices when imitated by humans is a monotone. Clearly this notion of what a robot sounds like was not based on listening to actual synthesized speech. It is more likely that the idea of "robotic" voices came from what people imagined a synthetic voice would sound like, rather than what actual synthetic voices sounded like.
 * Regardless of all this, to the contemporary reader, the idea of the voice sounding "robotic" is probably a fairly safe if perhaps preposterous in the literal sense base point to explain what old speech synthesis systems sounded like. Nohat 06:50, 25 October 2005 (UTC)

Open source software
Are there any open source speech synthesis projects? It would be great to summarize how the best few are doing or note the lack if there are none. &mdash; Hippietrail 17:36, 15 April 2006 (UTC)


 * I used a freeware character mode app.for OS/2 thatdid text to speech back in 1997. Jimj wpg (talk) 05:13, 22 July 2020 (UTC)

Possible copyvio
A possible copyvio concern has arisen in the Feature Article review. User:Marskell wrote "I believe the Concatenative Synthesis section may be a text dump from here". This is a serious concern that should be addressed inmediately/ Joelito (talk) 19:30, 7 November 2006 (UTC)

External links cleanup
External links section was getting filled with lots of links to similar websites. WP is not a directory of links WP:NOT:
 * "Wikipedia articles are not mere collections of external links or internet directories. There is nothing wrong with adding one or more useful content-relevant links to an article; however, excessive lists can dwarf articles and detract from the purpose of Wikipedia"

I went ahead and removed most of the external links and added DMOZ category for speech synthesis (per WP recommendation). If you feel that any of the deleted links contribute substantially more than the others, please feel free to leave a comment here and we all can discuss. Thanks! Calltech 18:43, 20 December 2006 (UTC)

Fair use rationale for Image:MS Sam.ogg
Image:MS Sam.ogg is being used on this article. I notice the image page specifies that the image is being used under fair use but there is no explanation or rationale as to why its use in this Wikipedia article constitutes fair use. In addition to the boilerplate fair use template, you must also write out on the image description page a specific explanation or rationale for why using this image in each article is consistent with fair use.

Please go to the image description page and edit it to include a fair use rationale. Using one of the templates at Fair use rationale guideline is an easy way to ensure that your image is in compliance with Wikipedia policy, but remember that you must complete the template. Do not simply insert a blank template on an image page.

If there is other fair use media, consider checking that you have specified the fair use rationale on the other images used on this page. Note that any fair use images lacking such an explanation can be deleted one week after being tagged, as described on criteria for speedy deletion. If you have any questions please ask them at the Media copyright questions page. Thank you.

BetacommandBot (talk) 13:25, 8 March 2008 (UTC)

How on earth could this be copyrighted? It's a voice saying a sentence. You can't copyright arbitrary audio from text-to-speech synthesizer. You can only copyright a specific recording. 99.14.103.236 (talk) 03:42, 20 September 2009 (UTC)

Text to speech based on Festival in Unix
www.wordtosound.com installed on a unix box. type any text (english only) and output as downloadable file wav or mp3. Voice is british accent and kind of croaky, but understandable. More clear in the wave format. —Preceding unsigned comment added by 69.85.110.110 (talk) 21:04, 27 May 2008 (UTC)

Suggest that Heterogeneous Relation Graph (HRG) and Delta should be described here
These comprise an important phase of most modern TTS systems and should be discussed here. I don't currently have the time to add this section, but if no one else gets around to it, I'll come back and write up a few things when I'm less busy. Twikir (talk) 04:08, 15 April 2009 (UTC)Twikir

Craptalker
There's a program online that might count as a speech synthesizer. It's called "CrapTalker". Should it be added to the links?

http://www.computerpranks.com/software/default.cfm?ItemID=1690

Sohzq (talk) 13:47, 26 May 2009 (UTC)

Text-to-speech voices
Can a new section and article be made comparing the Text-to-speech voices ? Besides the conventional microsoft Sam and microsoft Anna, some other voices might exist ?

Also, does a voice like the [Monster, Alien, or Amplifier Halloween voice changer exist ? These voices were featured in Fun with Dick & Jane; (see here and here) may be somewhat harder to make understand dough, but could still be used in some applications —Preceding unsigned comment added by 91.176.6.252 (talk) 11:55, 16 June 2009 (UTC)

Overview of text processing figure
Shouldn't the first block of the linguistic analysis component be "Phrasing" rather than "Phasing"? Broloks (talk) 16:55, 3 October 2009 (UTC)

A new alternatve front end?
The current front end to his technology seems to be soley text processing. But a recently reported study here demonstrated a method of reconstructing words based on the brain waves of patients simply thinking of those words, by monitoring the superior temporal gyrus of their brains. The 2012 study by Pasley et. al., reported in the journal PLoS Biology, used fMRI to track blood flow in the brains of 15 patients who were undergoing surgery for epilepsy or tumours, while playing audio of a number of different speakers reciting words and sentences. With the aid of a computer model, when patients were presented with words to think about, the team was able to guess which word the participants had chosen. Potential therapeutic implications have been suggested. Thanks. Martinevans123 (talk) 19:03, 14 February 2012 (UTC)

Robotics project attention needed
Chaosdruid (talk) 11:39, 24 March 2012 (UTC)
 * Refs - large amounts of text have no refs
 * Content - are all topics covered?
 * MoS compliance
 * Reassess

Needs more examples
Not to state the obvious, here, but this article needs more examples (i.e., sound files) of what kind of results the different types of speech synthesis can give. Right now the article has only two examples, neither tied to a specific section, and only one of which gives enough information to tell how it was generated. - dcljr (talk) 08:13, 14 January 2013 (UTC)

kurzweil reading machine
According to a source already cited in the article, Klatt, D. (1987) "Review of Text-to-Speech Conversion for English" Journal of the Acoustical Society of America 82(3):737-93, The Kurzweil Reading Machine is the first commercial text-to-speech synthesis system. That is on p. 770 Klatt has a diagram with a box for "Kurzweil Reading Machine, 1976" and under it it says "first commercial system". Do other sources disagree, or where is the best place to put this in the article? Silas Ropac (talk) 16:38, 8 February 2013 (UTC)

NeoSpeech & Natural Voices
After a quick short research looking for the most natural speech synthesis voices I found these two. I appreciate both but NeoSpeech seems more natural. --TudorTulok (talk) 14:33, 11 March 2013 (UTC)
 * AT&T Natural Voices® Text-to-Speech Demo
 * NeoSpeech

CFD-01?
> Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there

Is this applied "computational fluid dynamics" by any other name? Are the taking a computer-tomography headshot and a full dental panoramic shot of a person, turn those into a 3D model of the human vocal tract, which is then used as a virtual wind tunnel in a CFD software suite? If yes, the article could include a link. 82.131.151.104 (talk) 21:48, 21 August 2015 (UTC)


 * I think that Computational fluid dynamics (CFD) is not suitable for the articulatory synthesis, because CFD requires huge computational power for the simulation of turbulence typically seen on the fricative and plosive. Instead of it, more simplified physical models called distributed element model &mdash; such as the waveguide synthesis once used on Daisy Bell demo, or its variant, the Tube Resonance Model (TRM) used on Gnuspeech &mdash; seem to be practical. For details, see article "Articulatory synthesis". --Clusternote (talk) 08:43, 27 August 2015 (UTC)

Ride of the Vocaloids.
> This article says: Speech Synthesis is the artificial production of human speech

Is song synthesis a sub-field of speech synthesis or considered something completely different? If they are considered different, where is the boundary between them? (cue Richard Wagner's sprechgesang or the priestly intonations made during the catholic / orthodox christian holy liturgy). The current article doesn't discuss this dichtomy properly. 82.131.151.104 (talk) 22:16, 21 August 2015 (UTC)


 * The is an interdisciplinary field between the sound synthesis (focused on the harmonic structure), and the speech synthesis (focused on the language model and acoustic model).  And in my eyes, a clear definition of boundaries between them, seem hard to find due to the several reasons:
 * The researchers and developers of singing synthesis are very few, and they seem not have any conflicts of interest with the neighbor fields, so they and their neighbors might not feel the needs to clarify the boundaries for protecting each interests.
 * For the researchers and developers, probably it is an obvious thing that the singing synthesis is a customized version of speech synthesis which is optimized suitable for the song (for example, the clearness of fundamental frequency for playing melody, the harmonicity of spectral content for playing harmony, and possibly the smoothness and the delimitation between the pronunciations, as seen on Vocaloid.
 * In a view point of signal processing, speech synthesis and sound synthesis are similar in dealing with audio signals. However, in a view point of cognitive science (a science about how human brain recognizes the various media), the recognition processes of the speech and the music take the different routes on a human brain. (see Language processing in the brain, Cognitive neuroscience of music)  And a combination of these, the song's recognition process is probably described with coexistence of above two routes, and the additional interference between them. This interference, sometime called synergy, may be a main difficulty for defining the boundaries.
 * In my opinion, the singing synthesis should be described on a new dedicated article, to avoid the inappropriate narrowing of its potential caused by the thoughtless definition of boundaries. --Clusternote (talk) 02:45, 27 August 2015 (UTC)
 * P.S. I found a song by the Voder (an early speech synthesizer) circa 1939, as listened on a video. Its melody seems played with a pitch controller on the Voder. The inventors of vocoder & voder considered the musical application as the future plan, and Werner Meyer-Eppler in Germany wrote a paper in 1948 on his own perspectives. The separation of elemental technologies (speech synthesis, vocoder as musical application, and singing synthesis) may have occurred after that. --Clusternote (talk) 04:52, 29 August 2015 (UTC)

1234567890
Everybody, User:Wtshymanski, User:1989, please stop removing the "1,234,567,890 times (unintelligible noise)". These are literally heard in the clip. Largoplazo (talk) 16:30, 13 February 2017 (UTC)

DeepMind's WaveNet
I stumbled across a 'recent' paper published by Google DeepMind that supposedly introduced a new way of voice synthesis. I'm not too well-read on this area, but it appears to utilize a method that isn't mentioned in this wikipedia article, and improves significantly on the 'natural-ness' side. Toyuyn (talk) 10:28, 14 April 2017 (UTC)

Early history
In fiction, robots often speak in a monotonous voice stripped of all prosodic elements. This was especially common when I was a child, let's say around 1990. Of course even then, speech synthesis had progressed much further than that. But have there been historical speech robots, like the computers in the Bell Lab, that talked like that? Or is the monotonous robot voice just a common trope in fiction? Steinbach (talk) 21:06, 9 March 2018 (UTC)

Speak & Spell
I am missing the reference to an early toy from TI, named "Speak & Spell", that used speech synthesis in hardware. TI developed an integrated circuit to do the trick. In the 70s, microprocessors did not have the necessary power (and memory) to simply play pre-recorded PCM sounds as speech. Speech was synthesized with conventional, analog synthesis modules like in the musical instruments (cf. synthesizer). This could be done with much less computing power and memory.

And I think that Stephen Hawking's voice sounded just like Speak & Spell, so maybe it used this TI integrated circuit. 134.247.251.245 (talk) 14:21, 16 March 2018 (UTC)
 * It's mentioned in passing in the section History/Electronic Devices. The book "Hawking Incorporated" by Hlne Mialet says Hawking's voice synthesizer was made by a company called Speech Plus in 1986. --Wtshymanski (talk) 00:02, 17 March 2018 (UTC)

Digital recordings are not 'synthesis'
The article would be improved by making it clearer at the outset what 'synthesis' is, as opposed to playing back digital recordings. E.g. see this -excellent- article: Speech synthesizers Early electronic speech -- e.g. the Speak and Spell ... as well as several 'speaking' cars -- 'spoke' by 'playing back' complete words from a list of recorded words. That is speech 'reproduction', not production. True synthesis was a much harder problem to solve ... especially if a natural, not 'wooden', sound is desired. The synthesis of spoken words from a page of text is more difficult still. Twang (talk) 02:20, 22 March 2019 (UTC)

Apple VoiceOver is not a speech synthesiser
In the Apple section, it says: "The Apple iOS operating system used on the iPhone, iPad and iPod Touch uses VoiceOver speech synthesis for accessibility." This is incorrect as VoiceOver is a screen reader, not a speech synthesiser. One of the synths included with iOS is actually Nuance's Vocalizer.

I've not been able to find many sources to back this up, though. The best I can find is from this article from Tech.pinions.

Would this be good enough?

KaraLG84 (talk) 00:02, 20 August 2021 (UTC)

Edit notice
Can we get a consensus to put an edit notice here on this talk page to hopefully help quell the WP:NOTFORUM and vandalism problem happening here quite a bit? There was a similar edit notice implemented at Talk:SCP Foundation (which also gets a lot of NOTFORUM comments) recently and it can be seen at Template:Editnotices/Page/Talk:SCP Foundation. wizzito &#124;  say hello!  19:59, 14 March 2022 (UTC)
 * Apparently the source of these comments appears to be some kind of Indonesian meme regarding a feature on WhatsApp that seems to involve text-to-speech ringtones. (ex. https://widyawicara.com/article/read/ubah-nada-dering-whatsapp-dengan-text-to-speech) The Indonesian version of this article has been affected by the same vandalism. wizzito  &#124;  say hello!  00:01, 17 March 2022 (UTC)

Wiki Education assignment: Research Process and Methodology - FA22 - Sect 200 - Thu
— Assignment last updated by Jessssy (talk) 02:18, 17 November 2022 (UTC)

XSAMPA and the like
For the section "Text-to-phoneme challenges," I believe adding information about X-SAMPA, ARPAsing, or other ways of editing/using phonemes may help this portion of the article out, and would provide some sources into it. I don't have enough knowledge on the subject myself to efficiently and effectively research and edit it myself, but I thought I'd toss my two cents in. &#8635; dialupnetwork Connect? 17:48, 28 November 2023 (UTC)