Speech-to-Text and Phonetic Recognition

On September 1, 1878, Emma Nutt became the first female telephone operator, establishing a norm that lasted for over a century, until personnel who took up this occupation became greatly displaced by the use of technology that could recognize the human voice. Even though this service was not impacted by speech recognition software throughout most of its use, the technology was still in development for a large chunk of its timeline. Challenges with successfully designing software that can recognize human speech patterns derive from the complexities that exist within all languages, and any computer system that is capable of speech recognition must be able to properly process language in a manner similar to the human brain. Speech-to-text technology is rapidly advancing, but can only become a staple of human-computer interaction in the future once it can comprehend the variation in dialect on a level similar or greater than humans can.

The earliest notable speech recognition systems were present throughout the 1950s and 1960s, and although they were innovative by improving upon the basic systems prior to them that could only recognize digits, they were hardly capable of comprehending human language in its organic use. The first of these was the “Audrey” system by Bell Laboratories in 1952, which could recognize a single digit spoken by one voice, and it was followed ten years later with IBM’s “Shoebox” machine, which could understand 16 spoken words in English. This type of system was improved upon even more in the 1970s, mainly from the U.S. Department of Defense’s DARPA Speech Understanding Research (SUR) program, which led to Carnegie Mellon’s “Harpy” speech understanding system. Harpy’s biggest accomplishment was that it could understand 1011 words, a vocabulary that was considered the equivalent of a three year old.

In the 1980s, the use of Hidden Markov Models (HMMs)greatly expanded the vocabulary of speech recognition systems, but these were even more innovative in that they allowed for the potential use of an infinite amount of words. HMM was incredibly innovative because instead of possessing the understanding of specific words and their pronunciation, it approximated which sounds are words by determining the probability of all meaningful utterances. Another significant component of this model is that of prediction, since HMMs are able to determine what utterances will be used to complete a thought. Despite its importance, direct implementation of HMM principles would have poor accuracy and difficulties in different environments, so the model needed sophistication to properly work. In the 1990s, HMMs became automatic and were used by the public, during a time that also introduced the computer phone operator. Even though it became more widespread, speech recognition was still limited by the availability of data, something that Google advanced in the 2000s with cloud computing technology, allowing Google’s English Voice Search system to incorporate 230 billion words.

The ultimate goal of HMMs, which are still the primary basis for speech recognition technology today, is to recreate the process in which language is processed in the brain, or at least to create a procedure that can replicate it well enough to comprehend a human’s natural speaking pattern. Prediction, one of the main components of a HMM, has been disputed as one of the important aspects of language and language processing by different researchers, but according to sociolinguists Falk Huettig and Nivedita Mani, not all language users appear to predict language. This can be incredibly different for people at times, even if they are in their own speech community, so while this is something that speech recognition software makes use of to process words, it is not necessarily something that occurs in the human brain to accomplish the same task.

The idea of HMM calculating the probability of an unknown utterance makes use of language and the way that the human brain processes it in a manner that works well for a computer. A phoneme is the smallest unit of spoken language, and it is present in any kind of utterance that a person can make. However, in terms of actual spoken language, a morpheme is the more important unit, since it is the smallest meaningful utterance that cannot be divided into smaller segments. The category of morphemes is divided into free and bound morphemes, with free morphemes being those that can exist with meaning on their own, specifically base words, and bound morphemes altering the meaning of base words as affixes. For example, the word kind is a single morpheme, but if the suffix –ly were to attach to it, it would contain two morphemes. In addition, the inclusion of a morpheme can alter the meaning of the original word entirely, such as from prefix un- to form unkind. These different meanings and their composition to construct our spoken language is something that we are constantly processing as we listen to someone else speak.

While speech recognition technology understands language through computer-based computational approximation, we make use of one specific part of our brain: Wernicke’s area. This is located on the temporal lobe on the left side of the brain and is responsible for comprehending speech of others into meaning that we possess in our memory. If this part of the brain is damaged, it can lead to Wernicke’s Aphasia, a condition in which people can produce speech coherently, but are unable to process the speech of others.

Speech recognition technology likely has troubles with accents that are foreign to the technology’s country of origin, even if the same language is spoken in that country. For example, speech recognition software that is created in the United States for English speakers could have trouble recognizing certain words with the letter R as spoken by British speakers, since they generally only pronounce it once it is followed by a vowel.

However, having a domestic accent does not necessarily guarantee that this technology can understand what the speaker is saying. The standardization of language for a country is a very complex topic, but it fundamentally requires a process in which a particular form is selected and reinforced by teaching it in schools and other institutions. Other forms of language are considered inferior by this elite form, and individuals are viewed as speaking incorrectly by not using it. In the United States, the standardized form of language is known as Standard American English (SAE), a language that has never actually been defined and really does not exist. This “unaccented” form of English is really just an idealized form of the language, bearing no index of the regional origin or socioeconomic background of the speaker. When this language is taught in places such as schools, it often takes in some part of the regional accent or understanding of what SAE is according to the speaker, making it so that the standard form of English is different throughout the country.

This kind of variation in American English is something that people might believe to be cumbersome while speaking to someone of a different origin, even though all of the variations of the English that are spoken in the United States are mutually intelligible. Accents that are significantly phonetically different than that of listener can still be understood through the processing of the listener’s brain, something that could be challenging for speech recognition technology. This technology could favor Standard American English, or at least the form that the company or person producing the technology speaks, leading to a misunderstanding of certain words due to the unique pronunciation of phonemes in particular parts of the country. For example, in Great Lakes English, also known as Chicago Accent, the o in the word lot is forward and unrounded, so the word topcan sound like tap to outsiders. However, in a conversation with someone who speaks this dialect, the listener can have a certain level of cultural understanding to know what the speaker is saying. While speech recognition technology could comprehend this due to the context of the statement spoken, it would likely recognize the individual word incorrectly. Comprehension of phonetic differentiation throughout a single language will need to be enhanced in speech recognition, so that people can naturally speak into the technology for it to be used greatly as an input method. With the large amount of data that can be used, along with accent to the major heap of knowledge on the internet, this is something that can easily become perfected in the near future.