Speech Recognition

Extracts text transcript from target media.

The Speech Recognition API creates a transcript of the text in an audio or video file. You can then use this output with other Haven OnDemand APIs, such as Concept Extraction or Add to Text Index, to gain further insight and analysis.

The Speech Recognition API currently supports broadcast-quality content in several languages, as well as telephony grade audio for some of those languages.

For a list of supported video and audio file formats, see Supported Media Formats.

Quick Start

You can upload a video to the API as a file, in which case you must use a POST method.

curl -X POST http://api.havenondemand.com/1/api/async/recognizespeech/v2 --form "language_model=en-US" --form "file=@hpnext.mp4"

You can also input a URL or a Haven OnDemand reference.

Note: Because input files for this API can be large and take a long time to process, the API runs only in asynchronous mode. See Get the Results.

Note: This API has rate and duration limits:

  • Input files are truncated after 30 minutes.
  • The processing terminates after two hours and returns only what it has completed within that time.

For more information, see Rate Limiting, Quotas, Data Expiry, and Maximums.

The API provides a segmented transcription of the entire file. It also returns a start and end time offset and a confidence value for each word in the output.

For example, in our sample file, hpnext.mp4 returns (note: example output truncated to the first few items):


"items": [{
                "start_time_offset": 0.84,
                "end_time_offset": 0.95,
                "text": "we",
                "confidence": 97
}, {
                "start_time_offset": 0.95,
                "end_time_offset": 1.14,
                "text": "want",
                "confidence": 70
}, {
                "start_time_offset": 1.14,
                "end_time_offset": 1.2,
                "text": "to",
                "confidence": 78
}, {
                "start_time_offset": 1.2,
                "end_time_offset": 1.41,
                "text": "hear",
                "confidence": 90
}, {
                "start_time_offset": 1.41,
                "end_time_offset": 1.61,
                "text": "from",
                "confidence": 95
}, {
                "start_time_offset": 1.61,
                "end_time_offset": 1.93,
                "text": "you",
                "confidence": 86
}, {
                "start_time_offset": 2.45,
                "end_time_offset": 2.67,
                "text": "let's",
                "confidence": 91
}]

If you provide a URL, it must link directly to an audio or video file. You cannot link to a page with an embedded video (such as a news page or YouTube link).

/1/api/async/recognizespeech/v2?url=https://www.havenondemand.com/sample-content/videos/hpnext.mp4&language_model=en-US

Specify the Language

You must specify the language model for the recognition engine to use via the language_model parameter.

curl -X POST http://api.havenondemand.com/1/api/async/recognizespeech/v2 --form "language_model=en-US" --form "file=@hpnext.mp4"

The Recognize Speech API provides Broadband and Telephony language options. Each language is trained over a large body of representative data. The Broadband language options are trained on many hours of broadcast-quality content, such as TV news programs, while the Telephony language options are trained over many hours of voice calls.

For the highest accuracy, use the option and model that most closely resembles your voice data. For example, if you are processing voice mails recorded over a telephone network use the Telephony language option, if available. If you have a reasonable quality recording such as a webinar recorded with a good quality microphone, use the Broadband language.

For more information about how to pick the appropriate language option to get the best results for your data, see Language Models.

Get the Results

The asynchronous mode returns a job-id, which you can then use to extract your results. There are two methods for this:

  • Use /1/job/status/ to get the status of the job, including results if the job is finished.
  • Use /1/job/result/, which waits until the job has finished and then returns the result.

    Note: Because /result has to wait for the job to finish before it can return a response, using it for longer operations such as processing a large video file can result in an HTTP request timeout response. The /result method returns a response either when the result is available, or after 120 seconds, whichever is sooner. If the job is not complete after 120 seconds, the /result method returns a code 7010 (job result request timeout) response. This means that your asynchronous job is still in progress. To avoid the timeout, use /status instead.

Optimize Results

The quality of the audio file that you send can have a large effect on the quality of the speech recognition output. For example, the location of the microphone, background noise, and audio compression can all have an effect on how well the Speech Recognition API detects the words in a particular audio file. For more information on how to get the best results from this API, see Speech Processing Concepts.

Asynchronous
https://api.havenondemand.com/1/api/async/recognizespeech/v2

This API supports only asynchronous invocation.

Authentication

This API requires an authentication token to be supplied in the following parameter:

Parameter Description
apikey The API key to use to authenticate the API request.
Parameters

This API accepts the following parameters:

Required
Name Type Description
file
binary A media file containing the speech to transcribe. Multipart POST only.
reference
string A Haven OnDemand object store reference obtained from either the Expand Container or Store Object API. The corresponding video is passed to the API.
url
string A publicly accessible HTTP URL from which a video or audio file can be retrieved.
language_model
resource The language of the provided speech. For the highest accuracy, use the option and model that most closely resembles your voice data. For example, to process voice mails recorded over a telephone network use the Telephony language option, if available. If you have a reasonable quality recording such as a webinar recorded with a good quality microphone, use the Broadband language.

This API returns a JSON response that is described by the model below. This single model is presented both as an easy to read abstract definition and as the formal JSON schema.

Asynchronous Use

Additional requests are required to get the result if this API is invoked asynchronously.

You can use /1/job/status/<job-id> to get the status of the job, including results if the job is finished.

You can also use /1/job/result/<job-id>, which waits until the job has finished and then returns the result.

Model
This is an abstract definition of the response that describes each of the properties that might be returned.
Speech Recognition Response {
source_information ( Source_information , optional) Metadata information about a media file
items ( array[Items] ) The format of speech transcription results in the response.
}
Speech Recognition Response:Source_information {
mime_type ( string ) MIME type of the document.
video_information ( Video_information , optional) Information about the video track if one is present.
audio_information ( Audio_information , optional) Information about the audio track if one is present.
}
Speech Recognition Response:Source_information:Video_information {
width ( integer ) The width of the video in pixels.
height ( integer ) The height of the video in pixels.
codec ( string ) The algorithm used to encode the video.
pixel_aspect_ratio ( string ) The aspect ratio of pixels in the video. For example, if the video is made up of square pixels this value is 1:1.
}
Speech Recognition Response:Source_information:Audio_information {
codec ( string ) The algorithm used to encode the audio.
sample_rate ( integer ) The frequency at which the audio was sampled.
channels ( integer ) The number of channels present in the audio. For example, for stereo this value is 2.
}
Speech Recognition Response:Items {
start_time_offset ( number ) Time from the start of the media to where the word starts. This value is expressed as a non integer number.
end_time_offset ( number ) Time from the start of the media to where the word ends. This value is expressed as a non integer number.
text ( string ) The word(s) being spoken at the specified time.
confidence ( number ) A value (0-100) of confidence in the transcription.
}
Model Schema
This is a JSON schema that describes the syntax of the response. See json-schema.org for a complete reference.
{
    "properties": {
        "source_information": {
            "properties": {
                "mime_type": {
                    "type": "string"
                },
                "video_information": {
                    "properties": {
                        "width": {
                            "type": "integer",
                            "minimum": 1
                        },
                        "height": {
                            "type": "integer",
                            "minimum": 1
                        },
                        "codec": {
                            "type": "string"
                        },
                        "pixel_aspect_ratio": {
                            "type": "string"
                        }
                    },
                    "type": "object",
                    "required": [
                        "width",
                        "height",
                        "codec",
                        "pixel_aspect_ratio"
                    ]
                },
                "audio_information": {
                    "properties": {
                        "codec": {
                            "type": "string"
                        },
                        "sample_rate": {
                            "type": "integer"
                        },
                        "channels": {
                            "type": "integer"
                        }
                    },
                    "type": "object",
                    "required": [
                        "codec",
                        "sample_rate",
                        "channels"
                    ]
                }
            },
            "required": [
                "mime_type"
            ],
            "type": "object"
        },
        "items": {
            "items": {
                "properties": {
                    "start_time_offset": {
                        "type": "number"
                    },
                    "end_time_offset": {
                        "type": "number"
                    },
                    "text": {
                        "type": "string"
                    },
                    "confidence": {
                        "type": "number"
                    }
                },
                "required": [
                    "start_time_offset",
                    "end_time_offset",
                    "text",
                    "confidence"
                ],
                "type": "object"
            },
            "type": "array"
        }
    },
    "required": [
        "items"
    ],
    "type": "object"
}
https://api.havenondemand.com/1/api/async/recognizespeech/v2
/api/api-example/1/api/async/recognizespeech/v2
Examples
See this API for yourself - select one of our examples below.
HP Next
Transcribe a video about HP
Parameters
Required
Select file Change Remove
Name Type Value
language_model
resource

Note: This API will be invoked asynchronously.



Async – Response An error occurred making the API request
Response Code:
Response Body

	
Making API Request…
Checking result of job

To try this API with your own data and use it in your own applications, you need an API Key. You can create an API Key from your account page - API Keys.

Output Refresh An error occurred making the API request View Input
Rendered RawHtml Response
Result Display
Response Code:
Response Body:

			
Make this call with curl


If you would like to provide us with more information then please use the box below:

We will use your submission to help improve our product.