---

History of Speech Technology

 

Speech technology has gone through several phases of innovation, each one building upon the shortcomings of the previous generation. Many remember using speech technology over the phone to remotely program an interactive video recording system. These systems were able to recognize a limited number of keywords such as “Yes” and “No” or the number “5.” If more than one word was present, these systems needed a pause of silence in between to differentiate the words. Unfortunately, conversational speech does not naturally have these pauses.

The next evolution in speech technology was phonetic indexing. This was a fast way of finding matches as it was only looking for base phonemes and specific grouping of phonemes. Unfortunately, this system had many false positives as sometimes words may appear in other words, such as “cat” in catastrophe. It was also sensitive to background noise, bandwidth of the call and, particularly, accents as the phonemes will differ considerably.

A language model was developed to address the issues presented by phonetic indexing. This technique used a dictionary and a pre-defined language model and gave a highly accurate recognition rate that could also find phrases. However, the language model could not distinguish between homophones (e.g. "eye" and "I") and heteronyms (e.g. "The bass player ate bass") and was confined to the pre-set dictionary. To make the language model more flexible, a self learning language model was developed that could learn new words. This improved model was highly accurate and could be set up without the need for the dictionary, but required massive computational requirements.

Today, the latest generation of speech technology delivers conceptual search. This approach utilizes advanced mathematics and complex algorithms to derive meaning from speech. Conceptual search addresses the shortcomings of previous speech technology models and provides the most accurate way of recognizing and finding speech because it understands what is being said. It can distinguish between homophones, heteronyms, as well as find and group things by concept. It can also find related information based on meaning and has lower computational need than some of the earlier generations of speech recognition technology.

 




Phoneme Processing

Phonemes are the smallest discreet sound-parts of language and form the basic components of any word. Phoneme matching attempts to break down words into their constituent phonemes and then match searched terms to combinations of phonemes as they occur in the audio stream. While this approach does not require a dictionary, it is limited in its accuracy and inability to make conceptual matched.

Phoneme processing is commonly used approach to audio recognition, but is frequently inaccurate and often returns high levels of false positives. Because words are treated simply as combinations of sounds with no awareness of their meaning in context, the system cannot differentiate between the required data, homophones, and phrases that share the same phoneme but bear no conceptual relation to the search terms. For example, the sentence “The computer can recognize speech” contains many of the same basic phoneme components as “The oil spill will wreck a nice beach,” while the meaning is entirely different. Phoneme processing cannot account for multiple expression of the same concept, so any information that is related to the search term but does not contain the same phoneme will not be returned.




Word Spotting

As with phoneme matching, word spotting techniques search for words out of context, so they are unable to differentiate between homophones and homonyms. Because the system relies on exact sound matches, it is also unable to account for changes in pronunciation that affect sound, but not the actual concepts behind spoken words, such as plurals. As with other purely phonetic approaches, word spotting cannot make conceptual associations and will frequently miss related information that is not included in the search terms.




Natural Language Processing

Natural Language Processing (NLP) is a form of human-to-computer interaction where the elements of human language, be it spoken or written, are formalized so that a computer can perform value-adding tasks based on that interaction. Autonomy’s approach differs from standard NLP use in that it is still able to harness the power of IDOL’s conceptual analysis. Autonomy’s NLP technology functions independently of linguistic restraints, giving Autonomy’s software universal application possibilities anywhere in the world.

Related DocumentsAutonomy Audio Broadcast White Paper