Speech Processing Concepts
Speech Processing Concepts for the Speech Recognition API

Speech Processing Concepts

This page explains some of the audio and speech processing concepts used by the Speech Recognition API, and how to get the best results from the API.

For a good-quality audio file, the Speech Recognition API can usually return a good text transcription. However, if your results are not as good as you expect, this page can provide some information about factors that might help you improve them.

Understand Audio

To get the best results from audio processing in Haven OnDemand, it helps to understand some of the properties of audio.

Most humans cannot detect subtle differences in audio quality. The human brain is excellent at understanding meaning, and tends to ignore defects in audio if it can pick out the meaning. Because automatic speech processing technologies use statistical methods, varying attributes and imperfections in the audio alter the patterns computed in the analysis, which can lead to poor results.

The key to ensuring good results is to avoid imperfections and distortions in the audio.

Audio Sources and Capture

You can acquire audio for processing from many sources. For best results, the audio file should:

The following sections provide more detail about each aspect of audio quality.

Audio Capture

The positioning of the microphone during audio capture has significant influence on audio quality:

  • A microphone positioned very close to the mouth, such as a headset microphone, picks up speech very clearly. However, it also picks up any other noises in the aural cavity, such as lip smacking.

  • A microphone placed a couple of feet away from the speaker picks up speech that has arrived at the microphone from multiple reflected paths, in addition to the direct path.

The quality of these two signals is substantially different, leading to varying Speech Recognition success rates.


Clipping is a form of waveform distortion that occurs when the amplification of a signal is too high to deliver a sample value that can be represented within 16 bits. Clipping has a severe impact on the frequency properties of the audio signal.

To identify clipping in audio files, display them with a waveform display tool.

There is no known method for recovering the original signal from clipped audio for Speech Recognition requirements.

Background Noise

Background noise, including music, can significantly impact speech processing.

The SNR (signal-to-noise ratio) measures how much background noise is present in audio data. Typically, recordings with noticeable background noise can have an SNR as low as 10-15 dB. Good-quality recordings have an SNR above 25 dB.

Audio Compression

For purposes of efficient storage, media files are often stored in a compressed form. Audio compression often results in a small amount of distortion. The compression algorithms, used in codecs (codec stands for compression-decompression), have been developed over many years to minimize the distortion from a perceptual point of view. In general, when the compression rate increases, so does the distortion. But the newer codecs generally offer less distortion for the same compression rate, compared to the older codecs.

For a list of codecs that the Speech Recognition API supports, see Supported Media Formats.

Audio Sampling Rate

The sampling rate is the rate at which an audio signal is sampled and digitized. In general, the higher the sampling rate, the more information is preserved in the audio signal. This table lists commonly used sampling rates.

Sampling rate (Hz) Use
8,000 Telephone; adequate for human speech but without sibilance; 'ess' sounds like 'eff' (/s/, /f/).
11,025 One quarter the sampling rate of audio CDs. Used for lower-quality PCM and MPEG audio, and for audio analysis of subwoofer bandpasses.
16,000 Wideband frequency extension over standard telephone narrowband 8,000 Hz. Used in most modern VoIP and VVoIP communication products.
22,050 One half the sampling rate of audio CDs. Used for lower-quality PCM and MPEG audio.
32,000 MiniDV digital video camcorder, video tapes with extra channels of audio (for example, DVCAM with 4 Channels of audio), DAT (LP mode), NICAM digital audio, used alongside analog television sound in some countries. Suitable for digitizing FM radio.
44,056 Used by digital audio locked to NTSC color video signals (245 lines by 3 samples by 59.94 fields per second = 29.97 frames per second).
44,100 Audio CD, also most commonly used with MPEG-1 audio (VCD, SVCD, MP3). Much professional audio equipment uses (or is able to select) 44,100 Hz sampling, including mixers, EQs, compressors, reverb, crossovers, recording devices.

The Speech Recognition API accepts audio files with a range of sampling rates. Haven OnDemand recommends that the sampling rates are at least the sampling rate required by the language options. The minimum sampling frequencies are 8 kHz for processing telephony audio and 16 kHz for processing broadcast audio. Audio with sampling frequencies below this are upsampled, which causes severe quality issues.

The audio bandwidth can restrict whether you can sample an audio file at 8 kHz or 16 kHz. For more information, see "Audio Bandwidth" below.

Audio Bandwidth

The bandwidth is a property of the audio signal and represents the frequency up to which the signal holds information. The bandwidth of a signal is often equal to, but never higher than, half the sampling rate (this principle is known as the Nyquist theorem). For example, an audio stream with a sampling rate of 32 kHz has a maximum bandwidth of 16 kHz. The bandwidth can also be lower than half the sampling rate.

You need to find the bandwidth of an audio stream to decide what sampling rate to choose.

In the Speech Recognition options, several languages have two language options. The telephony options can process audio at 8 kHz, and the broadband options can process audio at 16 kHz.

Audio streams sampled at 16 kHz contain more information than streams sampled at 8 kHz. Therefore, Haven OnDemand recommends that you choose the higher sampling rate where possible. However, an 8 kHz (telephony) language option expects the audio bandwidth to be close to 4 kHz, whereas the 16 kHz (broadband) language option expects the audio bandwidth to be close to 8 kHz.

If you discover that an audio stream has a bandwidth much lower than expected for the 16 kHz option, Haven OnDemand recommends that you downsample it to 8 kHz before you send it to the API. The bandwidth of an audio signal can be lower than half the sampling rate for many reasons, including low pass filtering and upsampling (increasing the sampling rate).

You can use waveform analysis tools, such as wavesurfer (http://sourceforge.net/projects/wavesurfer/) and Adobe Audition, to check the bandwidth of an audio stream.

Speech Recognition

The Speech Recognition API approach to converting speech to text is motivated by an information breakdown point of view of speech. This approach is used by most leading speech technologists.

  • Sentences are composed of words (the basis for the language models).
  • Words are composed of phonemes and allophones (for detailed definitions of both terms, see http://www.sil.org/linguistics/glossaryOfLinguisticTerms). This is the basis for the pronunciation dictionaries.
  • Each phoneme and allophone is described in terms of frequency spectrum-based features (the basis for the acoustic models).
  • Signal processing analysis (performed by the front-end algorithms) converts an audio signal into frequency spectrum-based features.

For this approach, you must specify a language option in the Speech Recognition API, which specifies the language pack to use. The language pack contains:

  • A language model, which contains information about how sentences are composed of words, as well as the word pronunciation dictionary (lexicon).
  • An acoustic model, which describes feature patterns for a complete set of allophones used by the particular natural language.

The following diagram shows the inputs and resources that the speech-to-text engine receives.

Audio Quality Guidelines

This section describes the audio properties required for accurate speech processing.

Several factors affect the recall rate (correct detection) of words or phrases:

  • Signal bandwidth
  • Background noise
  • Speech clarity, which can be affected by factors such as the accent and fluency of the speaker
  • Audio signal distortion, due to compression and storage
  • Breadth of language context

For best speech processing results, ensure that your audio conforms to the following guidelines:

  • The sampling frequency must be at least the sampling frequency required by the language option.

    Audio files with sampling frequencies below this are upsampled, which causes severe quality issues. The minimum sampling frequencies are 8 kHz for telephony audio and 16 kHz for broadcast audio.

  • The minimum SNR (signal-to-noise ratio) is 15 dB. An SNR of 25 dB or above produces the best results. This ratio is measured across the word or phrase being detected and not across the entire audio.

  • Words or phrases must be articulated reasonably clearly, and must largely conform to the language.

  • Speech-to-text performance is known to be poorer for non-native speakers than for native speakers.

  • Natural speech rates produce the best speech-to-text results. Speech that is faster or slower than this usually produces more errors.

  • Every recognized word is associated with an acoustic confidence value. Generally, false positives tend to have a lower acoustic confidence compared to true hits.

  • Newer audio codecs offer less distortion for the same rate of compression.

  • If the language context of the content is too broad, the effectiveness of the language model is reduced.