Language Identification

Identifies the language of a piece of text.

The Language Identification API analyzes a piece of text that you provide and returns the language of the text.

You can use Language Identification to determine the correct language settings to use for other Haven OnDemand APIs, such as Sentiment Analysis or Entity Extraction.

Quick Start

You must provide input text. The following example adds the text as plain text:

/1/api/[async|sync]/identifylanguage/v1?text=the+quick+brown+fox+jumps+over+the+lazy+dog

The API returns the language and the encoding, and details of the UTF-8 character ranges that the input text includes.

{
  "language": "english",
  "language_iso639_2b": "ENG",
  "encoding": "UTF8",
  "unicode_scripts": [
    "Basic Latin"
  ]
}

You can also provide the text in a file. In File mode, Haven OnDemand uses the Text Extraction API to extract the text from the file and then uses the extracted text in the API.

Note: API input is subject to a maximum size quota. If you upload text or a file that is too large, the API returns an error. For more information, see Rate Limiting, Quotas, Data Expiry, and Maximums.

You must provide a minimum of three words for language identification. However, you can improve the accuracy by providing more text. The amount of text that you must provide for accurate language identification depends on both the language and the type of text. For UTF-8 encoded languages that use a unique script, the Language Identification API might be able to identify the language using only a few characters. For other languages, the API might need a few sentences to accurately identify the language, and it might need a large paragraph to distinguish between two similar languages.

The amount of text required also depends on the type of text. For example, it is difficult to identify the language from a list of places, numbers, and names. If your text contains these things, you might need to provide more text to identify the language. For natural language text, such as a news article, the API can usually detect the language from fewer characters.

A full list of supported languages is found in the Response tab.

Synchronous
https://api.havenondemand.com/1/api/sync/identifylanguage/v1
Asynchronous
https://api.havenondemand.com/1/api/async/identifylanguage/v1
Authentication

This API requires an authentication token to be supplied in the following parameter:

Parameter Description
apikey The API key to use to authenticate the API request.
Parameters

This API accepts the following parameters:

Required
Name Type Description
file
binary A file containing the document to process. Multipart POST only.
reference
string A Haven OnDemand reference obtained from either the Expand Container or Store Object API. The corresponding document is passed to the API.
text
string The text to process. You must provide a minimum of three words.
url
string A publicly accessible HTTP URL from which the document can be retrieved.
Optional
Name Type Description
additional_metadata
boolean Set to true to get additional metadata information on the identified language. Default value: false.

This API returns a JSON response that is described by the model below. This single model is presented both as an easy to read abstract definition and as the formal JSON schema.

Asynchronous Use

Additional requests are required to get the result if this API is invoked asynchronously.

You can use /1/job/status/<job-id> to get the status of the job, including results if the job is finished.

You can also use /1/job/result/<job-id>, which waits until the job has finished and then returns the result.

Model
This is an abstract definition of the response that describes each of the properties that might be returned.
Language Identification Response {
encoding ( enum<Encoding> ) The identified encoding of the input text.
language ( enum<Language> ) The identified language of the input text.
language_iso639_2b ( enum<Language_iso639_2b> ) The ISO639-2B code for the identified language of the input text, "UND" if the language could not be identified.
unicode_scripts ( array[string] , optional) The UTF-8 character ranges that your input text includes.
}
enum<Language Identification Response:Encoding> {
'ARABIC' , 'ARABIC_ISO' , 'ASCII' , 'CHINESESIMPLIFIED' , 'CHINESETRADITIONAL' , 'CYRILLIC' , 'CYRILLIC_ISO' , 'CYRILLIC_KOI8' , 'EASTERNEUROPEAN' , 'EASTERNEUROPEAN_ISO' , 'EUC' , 'GREEK' , 'GREEK_ISO' , 'HEBREW' , 'HEBREW_ISO' , 'JIS' , 'KOREAN' , 'NORTHERNEUROPEAN' , 'NORTHERNEUROPEAN_ISO' , 'SHIFTJIS' , 'THAI' , 'TURKISH' , 'UTF8' , 'VIETNAMESE'
}
enum<Language Identification Response:Language> {
'afrikaans' , 'albanian' , 'amharic' , 'arabic' , 'armenian' , 'azeri' , 'basque' , 'belorussian' , 'bengali' , 'berber' , 'breton' , 'bulgarian' , 'burmese' , 'catalan' , 'cherokee' , 'chinese' , 'croatian' , 'czech' , 'danish' , 'dutch' , 'english' , 'esperanto' , 'estonian' , 'faroese' , 'finnish' , 'french' , 'gaelic' , 'georgian' , 'german' , 'greek' , 'greenlandic' , 'gujarati' , 'hebrew' , 'hindi' , 'hungarian' , 'icelandic' , 'indonesian' , 'isan' , 'italian' , 'japanese' , 'kannada' , 'kazakh' , 'khmer' , 'korean' , 'kurdish' , 'latin' , 'latvian' , 'lithuanian' , 'luxembourgish' , 'macedonian' , 'malayalam' , 'maltese' , 'maori' , 'mongolian' , 'nepali' , 'norwegian' , 'oriya' , 'pashto' , 'persian' , 'polish' , 'portuguese' , 'romanian' , 'russian' , 'serbian' , 'sindhi' , 'singhalese' , 'slovak' , 'slovenian' , 'somali' , 'spanish' , 'swahili' , 'swedish' , 'syriac' , 'tagalog' , 'tajik' , 'tamil' , 'telugu' , 'thai' , 'tibetan' , 'turkish' , 'ukrainian' , 'urdu' , 'uyghur' , 'uzbek' , 'vietnamese' , 'welsh' , 'unknown'
}
enum<Language Identification Response:Language_iso639_2b> {
'AFR' , 'ALB' , 'AMH' , 'ARA' , 'ARM' , 'AZE' , 'BAQ' , 'BEL' , 'BEN' , 'BER' , 'BRE' , 'BUL' , 'BUR' , 'CAT' , 'CHR' , 'CHI' , 'HRV' , 'CZE' , 'DAN' , 'DUT' , 'ENG' , 'EPO' , 'EST' , 'FAO' , 'FIN' , 'FRE' , 'GLE' , 'GEO' , 'GER' , 'GRE' , 'KAL' , 'GUJ' , 'HEB' , 'HIN' , 'HUN' , 'ICE' , 'IND' , 'ITA' , 'JPN' , 'KAN' , 'KAZ' , 'KHM' , 'KOR' , 'KUR' , 'LAO' , 'LAT' , 'LAV' , 'LIT' , 'LTZ' , 'MAC' , 'MAL' , 'MLT' , 'MAO' , 'MON' , 'NEP' , 'NPI' , 'NOR' , 'ORI' , 'PER' , 'POL' , 'POR' , 'PUS' , 'RUM' , 'RUS' , 'SRP' , 'SND' , 'SIN' , 'SLO' , 'SLV' , 'SOM' , 'SPA' , 'SWA' , 'SWE' , 'SYR' , 'TGL' , 'TGK' , 'TAM' , 'TEL' , 'THA' , 'TIB' , 'TUR' , 'UKR' , 'URD' , 'UIG' , 'UZB' , 'VIE' , 'WEL' , 'UND'
}
Model Schema
This is a JSON schema that describes the syntax of the response. See json-schema.org for a complete reference.
{
    "properties": {
        "encoding": {
            "enum": [
                "ARABIC",
                "ARABIC_ISO",
                "ASCII",
                "CHINESESIMPLIFIED",
                "CHINESETRADITIONAL",
                "CYRILLIC",
                "CYRILLIC_ISO",
                "CYRILLIC_KOI8",
                "EASTERNEUROPEAN",
                "EASTERNEUROPEAN_ISO",
                "EUC",
                "GREEK",
                "GREEK_ISO",
                "HEBREW",
                "HEBREW_ISO",
                "JIS",
                "KOREAN",
                "NORTHERNEUROPEAN",
                "NORTHERNEUROPEAN_ISO",
                "SHIFTJIS",
                "THAI",
                "TURKISH",
                "UTF8",
                "VIETNAMESE"
            ]
        },
        "language": {
            "enum": [
                "afrikaans",
                "albanian",
                "amharic",
                "arabic",
                "armenian",
                "azeri",
                "basque",
                "belorussian",
                "bengali",
                "berber",
                "breton",
                "bulgarian",
                "burmese",
                "catalan",
                "cherokee",
                "chinese",
                "croatian",
                "czech",
                "danish",
                "dutch",
                "english",
                "esperanto",
                "estonian",
                "faroese",
                "finnish",
                "french",
                "gaelic",
                "georgian",
                "german",
                "greek",
                "greenlandic",
                "gujarati",
                "hebrew",
                "hindi",
                "hungarian",
                "icelandic",
                "indonesian",
                "isan",
                "italian",
                "japanese",
                "kannada",
                "kazakh",
                "khmer",
                "korean",
                "kurdish",
                "latin",
                "latvian",
                "lithuanian",
                "luxembourgish",
                "macedonian",
                "malayalam",
                "maltese",
                "maori",
                "mongolian",
                "nepali",
                "norwegian",
                "oriya",
                "pashto",
                "persian",
                "polish",
                "portuguese",
                "romanian",
                "russian",
                "serbian",
                "sindhi",
                "singhalese",
                "slovak",
                "slovenian",
                "somali",
                "spanish",
                "swahili",
                "swedish",
                "syriac",
                "tagalog",
                "tajik",
                "tamil",
                "telugu",
                "thai",
                "tibetan",
                "turkish",
                "ukrainian",
                "urdu",
                "uyghur",
                "uzbek",
                "vietnamese",
                "welsh",
                "unknown"
            ]
        },
        "language_iso639_2b": {
            "enum": [
                "AFR",
                "ALB",
                "AMH",
                "ARA",
                "ARM",
                "AZE",
                "BAQ",
                "BEL",
                "BEN",
                "BER",
                "BRE",
                "BUL",
                "BUR",
                "CAT",
                "CHR",
                "CHI",
                "HRV",
                "CZE",
                "DAN",
                "DUT",
                "ENG",
                "EPO",
                "EST",
                "FAO",
                "FIN",
                "FRE",
                "GLE",
                "GEO",
                "GER",
                "GRE",
                "KAL",
                "GUJ",
                "HEB",
                "HIN",
                "HUN",
                "ICE",
                "IND",
                "ITA",
                "JPN",
                "KAN",
                "KAZ",
                "KHM",
                "KOR",
                "KUR",
                "LAO",
                "LAT",
                "LAV",
                "LIT",
                "LTZ",
                "MAC",
                "MAL",
                "MLT",
                "MAO",
                "MON",
                "NEP",
                "NPI",
                "NOR",
                "ORI",
                "PER",
                "POL",
                "POR",
                "PUS",
                "RUM",
                "RUS",
                "SRP",
                "SND",
                "SIN",
                "SLO",
                "SLV",
                "SOM",
                "SPA",
                "SWA",
                "SWE",
                "SYR",
                "TGL",
                "TGK",
                "TAM",
                "TEL",
                "THA",
                "TIB",
                "TUR",
                "UKR",
                "URD",
                "UIG",
                "UZB",
                "VIE",
                "WEL",
                "UND"
            ]
        },
        "unicode_scripts": {
            "items": {
                "type": "string"
            },
            "type": "array"
        }
    },
    "required": [
        "language",
        "language_iso639_2b",
        "encoding"
    ],
    "type": "object"
}
https://api.havenondemand.com/1/api/sync/identifylanguage/v1
/api/api-example/1/api/sync/identifylanguage/v1
Examples
See this API for yourself - select one of our examples below.
Identify English
New tests on human bones hidden in a Spanish cave for some 400,000 years set a new record for the oldest human DNA sequence ever decoded—and may scramble the scientific picture of our early relatives.
Identify German
Neue Versuche an menschlichen Knochen in einer spanischen Höhle versteckt für einige 400.000 Jahre einen neuen Rekord für den ältesten menschlichen DNA-Sequenz immer decodiert und kann den wissenschaftlichen Bild unserer frühen Verwandten klettern.
Identify Chinese
最新的大约40万年前的西班牙洞穴中发现的人类骨头测试,创造了已破译人类最古老DNA序列的新纪录-并且可能改变人类早期亲属的科学图谱。
Identify Japanese
約40年間スペインの洞窟の中に隠された人骨の新しいテストにて、これまでに解読された中で最も古い人間のDNA配列に関する新しい記録となりますーそしてこの発見が`これまでの人類早期の科学的な図を混乱させる可能性があります
Parameters
Required
Select file Change Remove
Optional
Name Type Value
additional_metadata
boolean
(Default: False)


ASync – Response An error occurred making the API request
Response Code:
Response Body

	
Making API Request…
Checking result of job

To try this API with your own data and use it in your own applications, you need an API Key. You can create an API Key from your account page - API Keys.

Output Refresh An error occurred making the API request View Input
Rendered RawHtml Response
Result Display
Response Code:
Response Body:

			
Make this call with curl


If you would like to provide us with more information then please use the box below:

We will use your submission to help improve our product.