Speech Recognition

Transcribes speech to text from a video or audio file.

The Speech Recognition API creates a transcript of the text in an audio or video file. You can then use this output with other Haven OnDemand APIs, such as Concept Extraction or Add to Text Index, to gain further insight and analysis.

The Speech Recognition API currently supports broadcast-quality content in several languages, as well as telephony grade audio for some of those languages. For a list of the available languages, see the Request tab. Check back soon for additional languages.

For a list of supported video and audio file formats, see Supported Media Formats.

Quick Start

You can upload a video to the API as a file, in which case you must use a POST method.

curl -X POST http://api.havenondemand.com/1/api/async/recognizespeech/v1 --form "file=@hpnext.mp4"

You can also input a URL or a Haven OnDemand reference.

Note: Due to the runtime of this API, it is available only as an asynchronous version. See Get the Results.

Note: This API has rate and duration limits:

  • Input files are truncated after 30 minutes.
  • The processing terminates after two hours and returns only what it has completed within that time.

For more information, see Rate Limiting, Quotas, Data Expiry, and Maximums.

The default options assume broadcast-quality American English, and the API provides a single, non-segmented transcription of the entire file. For example:

  
        "document": [
          {
            "content": "we want to hear from you let's get the conversation started about what's next for Hewlett Packard this is HP next this matters"
          }
        ]
  

If you provide a URL, it must link directly to an audio or video file. You cannot link to a page with an embedded video (such as a news page or YouTube link).

/1/api/async/recognizespeech/v1?url=https://www.havenondemand.com/sample-content/videos/hpnext.mp4

Specify the Language

If you have content in a language or dialect other than American English, you can specify the language parameter.

curl -X POST http://api.havenondemand.com/1/api/async/recognizespeech/v1 --form "language=en-US" --form "file=@hpnext.mp4"

The Recognize Speech API provides Broadband and Telephony language options. Each language is trained over a large body of representative data. The Broadband language options are trained on many hours of broadcast-quality content, such as TV news programs, while the Telephony language options are trained over many hours of voice calls.

For the highest accuracy, use the option and model that most closely resembles your voice data. For example, if you are processing voice mails recorded over a telephone network use the Telephony language option, if available. If you have a reasonable quality recording such as a webinar recorded with a good quality microphone, use the Broadband language.

For more information about how to pick the appropriate language option to get the best results for your data, see Speech Processing Concepts.

Segment the Output

The Speech Recognition API also allows you to segment the transcript either on each word or on a set time basis. The interval parameter takes a time in milliseconds. It also has the following special values:

  • 0. Segment on each word.
  • -1. Turn off segmentation.

The following example segments the transcript into 10-second intervals:

curl -X POST http://api.havenondemand.com/1/api/async/recognizespeech/v1 --form "interval=10000" --form "file=@hpnext.mp4"

 
        "document": [
          {
            "offset": 912,
            "content": "we want to hear from you let's get the conversation started about what's next for"
          },
          {
            "offset": 5092,
            "content": "Hewlett Packard this is HP next this matters"
          }
        ]
      

If you set interval to 0, the API also returns a confidence value and duration for each word in the output. For example:


        "document": [
          {
            "offset": 840,
            "content": "we",
            "confidence": 97,
            "duration": 110
          },
          {
            "offset": 950,
            "content": "want",
            "confidence": 70,
            "duration": 190
          },
          {
            "offset": 1140,
            "content": "to",
            "confidence": 78,
            "duration": 60
          },
          {
            "offset": 1200,
            "content": "hear",
            "confidence": 90,
            "duration": 210
          }
...

Get the Results

The asynchronous mode returns a job-id, which you can then use to extract your results. There are two methods for this:

  • Use /1/job/status/ to get the status of the job, including results if the job is finished.
  • Use /1/job/result/, which waits until the job has finished and then returns the result.

    Note: Because /result has to wait for the job to finish before it can return a response, using it for longer operations such as processing a large video file can result in an HTTP request timeout response. The /result method returns a response either when the result is available, or after 120 seconds, whichever is sooner. If the job is not complete after 120 seconds, the /result method returns a code 7010 (job result request timeout) response. This means that your asynchronous job is still in progress. To avoid the timeout, use /status instead.

Optimize Results

The quality of the audio file that you send can have a large effect on the quality of the speech recognition output. For example, the location of the microphone, background noise, and audio compression can all have an effect on how well the Speech Recognition API detects the words in a particular audio file. For more information on how to get the best results from this API, see Speech Processing Concepts.

Asynchronous
https://api.havenondemand.com/1/api/async/recognizespeech/v1

This API only supports Asynchronous invocation.

Authentication

This API requires an authentication token to be supplied in the following parameter:

Parameter Description
apikey The API key to use to authenticate the API request.
Parameters

This API accepts the following parameters:

Required
Name Type Description
file
binary A video file containing the speech to process. Multipart POST only.
reference
string A Haven OnDemand object store reference obtained from either the Expand Container or Store Object API. The corresponding video is passed to the API.
url
string A publicly accessible HTTP URL from which a video can be retrieved.
Optional
Name Type Description
interval
number The time interval to use to segment the speech in the output. Use -1 to turn off segmentation, 0 to segment on every word, and a positive number for a time interval (ms). Default value: -1.
language
enum The language of the provided speech. For the highest accuracy, use the option and model that most closely resembles your voice data. For example, to process voice mails recorded over a telephone network use the Telephony language option, if available. If you have a reasonable quality recording such as a webinar recorded with a good quality microphone, use the Broadband language. Default value: en-US.
Enumeration Types

This API's parameters use the enumerations described below:

language
The language of the provided speech. For the highest accuracy, use the option and model that most closely resembles your voice data. For example, to process voice mails recorded over a telephone network use the Telephony language option, if available. If you have a reasonable quality recording such as a webinar recorded with a good quality microphone, use the Broadband language.
ar-MSA Broadband Modern Standard Arabic
de-DE Broadband German
en-AU Broadband Australian English
en-AU-tel Telephony Australian English
en-CA Broadband Canadian English
en-CA-tel Telephony Canadian English
en-US Broadband US English
en-US-tel Telephony US English
en-GB Broadband British English
en-GB-tel Telephony British English
en-SG Broadband Singapore English
en-ZA-tel Telephony South African English
es-ES Broadband European Spanish
es-ES-tel Telephony European Spanish
es-LA Broadband Latin American Spanish
es-LA-tel Telephony Latin American Spanish
fa-IR Broadband Farsi (Persian)
fr-CA Broadband Canadian French
fr-CA-tel Telephony Canadian French
fr-FR Broadband French
fr-FR-tel Telephony French
it-IT Broadband Italian
ja-JP Broadband Japanese
ja-JP-tel Telephony Japanese
nl-NL Broadband Dutch
nl-NL-tel Telephony Dutch
pl-PL Broadband Polish
pl-PL-tel Telephony Polish
pt-BR Broadband Brazilian Portuguese
ro-RO Broadband Romanian
ro-RO-tel Telephony Romanian
ru-RU Broadband Russian
zh-CN Broadband Mandarin

This API returns a JSON response that is described by the model below. This single model is presented both as an easy to read abstract definition and as the formal JSON schema.

Asynchronous Use

Additional requests are required to get the result if this API is invoked asynchronously.

You can use /1/job/status/<job-id> to get the status of the job, including results if the job is finished.

You can also use /1/job/result/<job-id>, which waits until the job has finished and then returns the result.

Model
This is an abstract definition of the response that describes each of the properties that might be returned.
Speech Recognition Response {
document ( array[Document] ) The speech block transformed to text.
}
Speech Recognition Response:Document {
content ( string ) The extracted block of text from speech.
offset ( integer , optional) The position of the first word in this content section (in milliseconds from the start of the audio). This value returns only when you set interval to segment the audio.
confidence ( integer , optional) Confidence that this word is correct. This value returns only when you set interval to 0 (segment on every word).
duration ( integer , optional) The duration of the first word in this content section. This value returns only when you set interval to 0 (segment on every word).
}
Model Schema
This is a JSON schema that describes the syntax of the response. See json-schema.org for a complete reference.
{
    "properties": {
        "document": {
            "items": {
                "properties": {
                    "content": {
                        "type": "string"
                    },
                    "offset": {
                        "type": "integer"
                    },
                    "confidence": {
                        "type": "integer"
                    },
                    "duration": {
                        "type": "integer"
                    }
                },
                "required": [
                    "content"
                ],
                "type": "object"
            },
            "type": "array"
        }
    },
    "required": [
        "document"
    ],
    "type": "object"
}
https://api.havenondemand.com/1/api/async/recognizespeech/v1
/api/api-example/1/api/async/recognizespeech/v1
Examples
See this API for yourself - select one of our examples below.
HP Next
Transcribe a video about HP
Parameters
Required
Select file Change Remove
Optional
Name Type Value
interval
number
language
enum

Note: This API will be invoked asynchronously.



ASync – Response An error occurred making the API request
Response Code:
Response Body

	
Making API Request…
Checking result of job

To try this API with your own data and use it in your own applications, you need an API Key. You can create an API Key from your account page - API Keys.

Output Refresh An error occurred making the API request View Input
Rendered RawHtml Response
Result Display
Response Code:
Response Body:

			
Make this call with curl


If you would like to provide us with more information then please use the box below:

We will use your submission to help improve our product.