In our last post, we discussed speech-to-text technology, which has a background that varies from the history and current applications of text-to-speech technology. Text-to-speech, through the process of speech synthesis, has been in the works for a much longer time than speech-to-text, and it is more concerned with providing technology to aid people as opposed to the purpose of inputting information. However, text-to-speech can be successfully paired with speech-to-text to provide a type of human-computer interaction that does not require a keyboard.
The earliest known evidence of speech synthesis dates back to the Eighteenth Century, when Christian Kratzenstein explained physiological differences between five long vowels (/a/, /e/, /i/, /o/, and /u/) and made apparatus to produce them artificially in 1779 St. Petersburg. He constructed acoustic resonators that bore similarities to the human vocal tract, and when activated with vibrating reeds like in musical instruments, they produced sounds similar to the vowels that they were designed to sound like. Not too soon later, in 1791 in Vienna, Wolfgang von Kempelen introduced his Acoustic-Mechanical Speech Machine, which could produce single sounds, along with some sound combinations. In the mid-1800s, Charles Wheatstone produced a more advanced and famous version of Kempelen’s machine, which was able to produce vowels and most of the consonants, as well as some words. Vowels in this machine were produced with a vibrating reed, and consonants were produced with turbulent flow through a suitable passage with reed-off.
The technology was present in the early Twentieth Century to create electronic resonators, but it was not until the 1930s when the first electric speech synthesizers were created. Homer Dudley at Bell Laboratories accomplished this, initially with the “Voder”, which was first presented at the 1939 and 1940 World’s Fairs. The Voder functioned by having an array of ten parallel resonators that spanned all frequencies of the speech spectrum, and it was operated through the use of a keyboard that controlled different aspects of sound through keys and a pedal. This system was very complicated, and even after months of training, operators were only able to produce intelligible language if it were paired with the context of a question. Dudley’s other device, the channel vocoder, was released around the same time and it has lasted as the basis for many speech synthesis devices that are still in use today. This machine was divided into one-half that was meant to analyze an incoming speech signal from natural sound parameters, and another half that used that signal to produce a synthetic sound.
In recent history, speech synthesis has shifted from these large vocoder machines and has now become computer-based, making use of several different forms, the greatest-used of which is concatenative data-driven sound synthesis. Methods using this access a large database of source sounds, segmented into heterogeneous units, and a unit selection algorithm that finds the units that match best the sound to be synthesized, or the target. This procedure has been responsible for modern synthetic speech sounding much more like that spoken by a human being, as opposed to the more-robotic sounding voices of the past. This has mostly displaced a process known as diphone synthesis, in which recorded speech was split into phonemes, and all diphones, or two adjacent half-phones, were modeled. Since the “center” of a phonetic realization is the most stable region, all possible phonemes can be created in a language by modeling the diphones, but it works better for languages that have a smaller variety of diphones.
In our speech-to-text post, we looked at the differences between how computers and brains process speech. Generating speech is different from processing speech in the human body, since it requires the lungs, voice box, and nasal and oral cavities to produce meaningful audible utterances that can project to a listener, while the listener who is processing the speech uses a variety of parts of the inner ear to recognize sound waves. However, generating speech is similar to processing heard speech because it is performed by a small part of the brain. Broca’s area, also located on the left hemisphere of the brain like Wernicke’s area, but on the lower portion of the frontal lobe, is involved with speech production, facial neuron control, and, since it is connected to Wernicke’s area, language processing. In addition, like Wernicke’s area, damage to Broca’s area causes Broca’s Aphasia, a condition in which people can have difficulty producing grammatical sentences with a complex language structure, both orally and in writing. Paul Broca discovered this in 1861, while treating a patient who could only say the word “tan”.
Broca’s area makes use of the different phonemes that we have learned to guide our language production. This, like the unit selection algorithm used in speech synthesis technology, determines the sounds that we use to form the morphemes (smallest meaningful units) in a statement that we speak. However, methods such as concatenative sound synthesis and diphone selection only indirectly create intelligible words, since they are making use of prerecorded sounds spoken by an actual human being. Articulary speech synthesis is an alternative to these methods that directly simulates the principles of speech production by using a mechanism created to emulate the movement of the vocal tract. This simulates the different physical sound sources that are responsible for speech produced by the human body.
Synthetic speech currently has uses in assisting communication for several different purposes, including reading aides for the blind and speaking aides for the deaf. This technology also has applications in education, in which the artificially produced speech is programmed for specific tasks that help to teach spelling and pronunciation in different languages. Aside from general assistance, speech synthesis also has been put to use in the telecommunications industry, for things such as telephone inquiry systems.
One of the more recent uses of speech synthesis has been the pairing of it with speech-to-text speech recognition to create a type of human-computer interaction that is based entirely off speaking. The most widely used examples of this are virtual assistants Siri, present in Apple products, and Cortana, who is found in Microsoft products. Both of these use concatenative synthesis to form the computer-human interaction, each taking in many recordings of an individual person’s voice so that the software can form intelligible language. As this software improves and people come to be more reliant on interacting with their phones and computers by speaking, it will be interesting to see which voices are chosen to interact with users. Siri, while having different voices in different countries and now a male version, is voiced in American English by Susan Bennett, who is also the voice of Delta Airlines. The voice of Cortana, Jen Taylor, was not necessarily chosen because she has a neutral voice that would be appropriate to talk to many different people, but because she is the same actor who voiced the character Cortana in the Halo video game series from which the assistant derives her name.
With the unit selection of concatenative synthesis only requiring one voice to create a speech synthesis interface, there is no need to have different voice actors for the same language. However, Apple has different variations of English for Siri depending on the part of the world in which it is spoken, something that can be changed in an IPhone’s Settings. It is clear from this that these companies recognize that there is a clear difference between the variations of a language that is spoken in different countries, and that people might want to hear a voice that is similar to their own.
However, a voice that is used will always have some regional variation, never being exactly representative of an entire nation. Even though the solution to this would be to include different regional dialects, this could problematically lead to marginalization of different ways of speaking that are considered to be greatly distanced from the standard of that country by having the actors perform them in a comical way. For example, while performing a New York accent, even if it was the actor’s place of origin, he or she might make particular phonetic choices that would end up sounding cartoonish. The ultimate resolution to all of this would be software that includes every phoneme present throughout the world, being able to make use of the ones present in the region in which it is being used. This type of software will not be likely for some time, so until then we will have to make due by interacting with voices that we can understand but do not sound exactly like us.
Standards relating to speech and software ergonomics using voice input may be found on the ANSI Webstore.