How NaturallySpeaking Works
Dragon NaturallySpeaking is a cutting-edge speech recognition application that reliably delivers highly accurate recognition over a large scale vocabulary. Learning about the methods used by NaturallySpeaking will help you to obtain a better understanding of, and better results within Dragon NaturallySpeaking.
NaturallySpeaking’s Goal - Increasing Recognition Accuracy
The primary goal of NaturallySpeaking, and by extension all forms of speech recognition software, is to increase user productivity by providing a more immediate and effective means of interacting with the computer than typing. To succeed in this goal, speech recognition software must be able to identify an incoming stream of sounds as words dictated by the user. Recognition accuracy is the measure of the software's ability to identify these sounds as words correctly.
Whilst occasional recognition errors are normal in all speech recognition systems (as they are in human conversation), frequent errors or repeated "base" errors (such as not recognising an important name or common technical term) diminish both absolute and perceived recognition accuracy. This, in turn, makes users less productive and less satisfied with the software.
NaturallySpeaking’s large standard vocabulary, and NaturallySpeaking Professional’s additional specialised vocabularies, increase the chance of a name or term being recognised correctly the first time it’s dictated, allowing users to achieve higher levels of productivity and satisfaction with the product.
In trying to identify an incoming stream of sounds as specific words, NaturallySpeaking relies on two main sources of information:
• An acoustic model that catalogues how each word sounds, enabling NaturallySpeaking to choose words that sound most similar to the spoken utterance; and a profile of the user’s voice, which is constantly updated to increase recognition accuracy.
• A vocabulary and associated language model provides statistical information about the probability of specific words, and sequences of those words, occurring in the utterance. When there is ambiguity between similar sounding words, the language model helps to identify the most likely word.
These two information sources interrelate and work together to capture utterances and turn them into accurate, reliable speech recognition.
NaturallySpeaking and the Acoustic Model
The way you speak is totally distinctive, and no-one on earth sounds exactly the same way as you do. Dragon NaturallySpeaking relies on this individuality to create a unique mathematical model of your voice's sound patterns.
NaturallySpeaking analyses each sound you make and compares it to a database of over 10,000 possible syllables in the English language. As it becomes more familiar with your speech patterns (a process greatly enhanced by training the application when creating a new user profile), it becomes more accurate in identifying individual sounds. For example, the way you pronounce a “th” sound changes how Dragon NaturallySpeaking responds to any word with that sound in its pronunciation.
As the acoustic model recognises sounds, it’s the vocabulary’s task to relate those sounds to actual words.
NaturallySpeaking and vocabularies
A vocabulary in Dragon NaturallySpeaking is compiled from a body of information that typically includes a word list and a language model. The word list adds words to the Dragon NaturallySpeaking’s active vocabulary (which is loaded into RAM and allows instant recognition) and backup dictionary (which has an expanded number of words for correction purposes) to improve the language model and recognition accuracy when the vocabulary is compiled. The language model contains usage and context information about all the words.
Therefore Dragon NaturallySpeaking uses a vocabulary to recognise words correctly based not only on the sounds of the words, but also on the context of those words within your current document.
Furthermore every vocabulary is associated with a specific user. This enables every vocabulary to be personalised with language information that applies only to that user, allowing the user to add terms, phrases and words not included in Dragon NaturallySpeaking’s default vocabularies. In addition, a user can have many vocabularies to suit various tasks (ie: general use, work use, social use, etc.)
All words in the vocabulary have an initial set of pronunciations. The acoustic model uses these pronunciations to decide which words most closely match what was spoken. A word may have more than one pronunciation assigned to it, such as the word "either," which may be pronounced "EE-ther" or "EYE-ther; and in turn a pronunciation may have more than one word assigned to it, such as the words “to”, “too” and “two”. In this case, Dragon NaturallySpeaking’s language model assesses the context of the word within the sentence to determine which word is most correct.
Text representation of the word
The text representation of a word is the string of characters that Dragon NaturallySpeaking displays when the word is recognised. In addition to the lowercase letters of the alphabet, a text representation can contain the upper case letters as well as spaces and punctuation. For example, a vocabulary might contain the following words:
| Written Form | Spoken Form |
|---|---|
| IV | four |
| Newcastle | new castle |
| Ph.D. | P. H. D. |
This level of accurate representation allows users to add new terms, phrases and acronyms to their vocabulary, and determine how Dragon NaturallySpeaking recognises them.
The language model
In addition to a word list, a vocabulary has a language model containing statistical information that predicts which words are most likely to occur in the context of the user's speech. This information includes, but is not limited to:
• The unigram probability of each word, that is, the likelihood of this word being used in text compared with other words in the same vocabulary. For example, if the verb "write" is more likely to occur in text compared with the name "Wright," then "write" will have a higher unigram probability.
• The bigram/trigram/quadgram probability, that is, the likelihood of two-word, three-word and four-word sequences occurring in text. For example, if the bigram "Mr. Wright" is more likely than "Mr. write," the language model should favour "Mr. Wright" even though "write" has a higher unigram probability than "Wright." in this context the bigram/trigram probability outweighs the unigram probability.
Language model "slots"
Every vocabulary has three "slots" for storing language model information (although not all vocabularies contain information in each slot).
• The base slot stores the base language model that ships with Dragon NaturallySpeaking.
• The middle slot may contain a custom language model based on a significant amount of data developed for a target group of users (ie: specialised legal or medical applications). Only Dragon NaturallySpeaking Professional can use vocabularies that have a language model in the middle slot.
• The user slot may contain a user language model based on a relatively small amount of data, normally used by a single user.
How Dragon NaturallySpeaking uses language models
As a user dictates, Dragon NaturallySpeaking uses the statistical data in the current vocabulary to predict the word that most likely matches what the user actually said. This information includes both the unigram probabilities of individual words and the bigram/trigram/quadgram probability for sequences of words.
For example, the language model favours the word pair "world affairs" over "whirled affairs" because "world" occurs more frequently than "whirled", giving it a higher unigram probability; and because "world affairs" occurs more frequently than "whirled affairs" in the speech and writing of target users, resulting in a higher bigram probability. Notice that the two-word sequences sound exactly alike, so the acoustic model can do nothing to differentiate them; the language model calculation is the only information that can help make the right choice.
If the current vocabulary has no information in the user slot or middle slot, Dragon NaturallySpeaking makes its prediction based on the language model in the base slot. If the current vocabulary has information in the user slot, middle slot, or both, Dragon NaturallySpeaking makes its predictions by combining the statistical information of each slot, according to the following principle: the better a slot is over time at predicting the correct word, the more weight that slot will be given for future predictions.
Note that this principle ensures that over time, regardless of changes made to the user and middle slots, predictions would generally be as good as or better than those made based solely on the base slot.
The total package
It’s this combination of processes that has made Dragon NaturallySpeaking the most accurate speech recognition program available today. Now you know how NaturallySpeaking recognises your speech, you can exploit this process (such as tailoring your vocabularies and creating specialised user profiles for different tasks) to gain the maximum possible benefits from the application.