Text Tokenization

Returns information about the terms in the specified text.

The Text Tokenization API helps you find more information about the terms that you might want to use in the Query Text Index API or other APIs. You provide a list of terms, or some text, and the API returns information about the terms in the text.

Quick Start

You can provide the terms that you want to provide information on in the text parameter. You can also return information about the terms in a document, by providing the document in the file, reference, or url parameter.

By default, Haven OnDemand compares the terms in your document to the terms in the English Wikipedia public text index. You can optionally specify another text index, by setting the indexes parameter. In this case, Haven OnDemand provides information on your terms based on the index or indexes that you specify.

The results of the Text Tokenization API include:

  • the weight that the term holds in the specified text indexes, based on Advanced Probabilistic Concept Modelling (APCM).

  • the number of documents that the term occurs in, in the specified text indexes.

  • the total number of times that the term occurs, in the specified text indexes.

For example:

/1/api/[async|sync]/tokenizetext/v1?text=probability+theory

The term included in the response is the term that Haven OnDemand uses after processing. The processing includes stemming (reducing plurals and verb forms of a word to the same stem) and transliteration (converting accented characters to non-accented forms).

{
  "terms": [
    {
      "term": "PROBAB",
      "weight": 77,
      "documents": 49610,
      "occurrences": 191725,
      "case": 1,
      "length": 6
    },
    {
      "term": "THEOR",
      "weight": 52,
      "documents": 226510,
      "occurrences": 1502845,
      "case": 1,
      "length": 5
    }
  ]
}

When you specify multiple text indexes, the API returns each term once, with information from each text index reflected in the occurrence counts and weight information. If a term that you specify does not occur in the specified text indexes, the API returns the term with a default weight (the occurrence counts are zero).

If a term that you specify is a stop word (a very common term that does not provide much meaning, such as the, a, of in English), the API returns the term, but the occurrence counts and weights are all zero, because Haven OnDemand does not store these terms. For more information about stop words, see Stop Lists in Text Indexes.

You can use the Text Tokenization API to find information about the terms that Haven OnDemand matches in the Query Text Index API and other related APIs, such as Find Related Concepts and Find Similar. You can provide your complete query text in the text parameter, and the API ignores query syntax, such as Boolean and proximity operators, and provides information about the query terms. You can use this approach to find out the exact terms that Haven OnDemand matches in the query.

For example:

/1/api/[async|sync]/tokenizetext/v1?text=lunar+NEAR+crater&indexes=news_eng

{
  "terms": [
    {
      "term": "LUN",
      "weight": 106,
      "documents": 1310,
      "occurrences": 2440,
      "case": 1,
      "length": 3
    },
    {
      "term": "CRAT",
      "weight": 132,
      "documents": 369,
      "occurrences": 617,
      "case": 1,
      "length": 4
    }
  ]
}

If you want to tokenize the query operators as normal text, you can set the ignore_operators parameter to true. For example:

/1/api/[async|sync]/tokenizetext/v1?text=lunar+NEAR+crater&indexes=news_eng&ignore_operators=true

{
  "terms": [
    {
      "term": "LUN",
      "weight": 106,
      "documents": 1310,
      "occurrences": 2440,
      "case": 1,
      "length": 3,
      "start_pos": 1
    },
    {
      "term": "NEAR",
      "weight": 45,
      "documents": 25815,
      "occurrences": 31791,
      "case": 0,
      "length": 4,
      "start_pos": 7
    },
    {
      "term": "CRAT",
      "weight": 132,
      "documents": 369,
      "occurrences": 617,
      "case": 1,
      "length": 4,
      "start_pos": 12
    }
  ]
}

For more information about text tokenization in Haven OnDemand, see Text Tokenization and Processing in Text Indexes.

Synchronous
https://api.havenondemand.com/1/api/sync/tokenizetext/v1
Asynchronous
https://api.havenondemand.com/1/api/async/tokenizetext/v1
Authentication

This API requires an authentication token to be supplied in the following parameter:

Parameter Description
apikey The API key to use to authenticate the API request.
Parameters

This API accepts the following parameters:

Required
Name Type Description
file
binary A file containing the document to process. Multipart POST only.
reference
string A Haven OnDemand reference obtained from either the Expand Container or Store Object API. The corresponding document is passed to the API.
text
string The text content to process.
url
string A publicly accessible HTTP URL from which the document can be retrieved.
Optional
Name Type Description
indexes
array<resource> The name of the Haven OnDemand text index that you want to search for results. You can use the public datasets, or your own text indexes. See Public Text Indexes. Default value: [wiki_eng].
max_terms
number The maximum number of terms to return results for. Default value: 50.
stemming
boolean Set this parameter to false to return the list of terms in their unstemmed form. Default value: true.
ignore_operators
boolean Set this parameter to true to tokenize query operators within the input text as plain text. Default value: false.

This API returns a JSON response that is described by the model below. This single model is presented both as an easy to read abstract definition and as the formal JSON schema.

Asynchronous Use

Additional requests are required to get the result if this API is invoked asynchronously.

You can use /1/job/status/<job-id> to get the status of the job, including results if the job is finished.

You can also use /1/job/result/<job-id>, which waits until the job has finished and then returns the result.

Model
This is an abstract definition of the response that describes each of the properties that might be returned.
Text Tokenization Response {
terms ( array[Terms] ) The details of the tokenized terms.
}
Text Tokenization Response:Terms {
case ( enum<Case> , optional) A value between 0 and 3 to indicate the case of the term.
documents ( number ) The number of documents that the specified term occurs in.
length ( number , optional) The length of the term in characters.
numeric ( enum<Numeric> , optional) A value indicating whether the term contains numbers.
occurrences ( number ) The number of times the specified term occurs.
start_pos ( number , optional) The position in bytes of the start of the word in the original text.
stop_word ( string , optional) This tag appears only if the term is a stop word.
term ( string ) The term whose information is displayed.
weight ( number ) The weight of the specified term.
}
enum<Text Tokenization Response:Terms:Case> {
0 The input text was all in upper case.
1 The input text was all in lower case.
2 The input text appeared with initial capitals.
3 The input text appear in another type of mixed case.
}
enum<Text Tokenization Response:Terms:Numeric> {
0 The input text contained only alphabetical characters.
1 The input text contained only numeric characters.
2 The input text contained mixed alphbetical and numeric characters.
}
Model Schema
This is a JSON schema that describes the syntax of the response. See json-schema.org for a complete reference.
{
    "properties": {
        "terms": {
            "items": {
                "properties": {
                    "case": {
                        "enum": [
                            0,
                            1,
                            2,
                            3
                        ]
                    },
                    "documents": {
                        "type": "number"
                    },
                    "length": {
                        "type": "number"
                    },
                    "numeric": {
                        "enum": [
                            0,
                            1,
                            2
                        ]
                    },
                    "occurrences": {
                        "type": "number"
                    },
                    "start_pos": {
                        "type": "number"
                    },
                    "stop_word": {
                        "type": "string"
                    },
                    "term": {
                        "type": "string"
                    },
                    "weight": {
                        "type": "number"
                    }
                },
                "required": [
                    "term",
                    "weight",
                    "documents",
                    "occurrences"
                ],
                "type": "object"
            },
            "type": "array"
        }
    },
    "required": [
        "terms"
    ],
    "type": "object"
}
https://api.havenondemand.com/1/api/sync/tokenizetext/v1
/api/api-example/1/api/sync/tokenizetext/v1
Examples
See this API for yourself - select one of our examples below.
Terms
Return term information for the terms 'probability' and 'theory'
Unstemmed Terms
Return term information for the exact (unstemmed) terms 'probability' and 'theory'
Terms with Limit
Return term information on the first 25 terms in the following text
Parameters
Required
Select file Change Remove
Optional
Name Type Value
indexes
array
max_terms
number
stemming
boolean
(Default: True)
ignore_operators
boolean
(Default: False)


Async – Response An error occurred making the API request
Response Code:
Response Body

	
Making API Request…
Checking result of job

To try this API with your own data and use it in your own applications, you need an API Key. You can create an API Key from your account page - API Keys.

Output Refresh An error occurred making the API request View Input
Rendered RawHtml Response
Result Display
Response Code:
Response Body:

			
Make this call with curl


If you would like to provide us with more information then please use the box below:

We will use your submission to help improve our product.