Text Extraction

Extracts text from a file.

The Text Extraction API uses HPE KeyView to extract metadata and text content from a file that you provide. The API can handle over 500 different file formats (for more information, see Supported Formats).

Quick Start

You must provide an input file, which you can specify as a file, a URL, or an object reference. You can create an object reference by using the Store Object API to store a file, which you can then use in the API. The following example submits a file:

curl -X POST http://api.havenondemand.com/1/api/[sync|async]/extracttext/v1 --form "file=@mydoc.doc"

Note: API input is subject to a maximum size quota. If you upload text or a file that is too large, the API returns an error. For more information, see Rate Limiting, Quotas, Data Expiry, and Maximums.

By default the Text Extraction API extracts all the metadata it can from the file provided, as well as the text from the main content of the file.

{
  "document": [
    {
      "reference": "root",
      "doc_iod_reference": "dd4a9c274eda24f9bac59c065cf979e4",
      "app_name": [
        "Microsoft Office Word"
      ],
      "author": [
        "janesmith"
      ],
... other metadata fields ...
      "content": "This is the content of my document..."
    }
  ]
}

If the file is protected, you can send the password in the password parameter. For example:

curl -X POST http://api.havenondemand.com/1/api/[sync|async]/extracttext/v1 --form "file=@mydoc.doc" --form "password=myfilepassword"

You can also specify URLs as input. In this case, Haven OnDemand retrieves the file and extracts the text. For example:

/1/api/[sync|async]/extracttext/v1?url=http://mysite.com/mydoc.doc

You can disable metadata and text extraction individually to return only the text or only the metadata. For example:

curl -X POST http://api.havenondemand.com/1/api/[sync|async]/extracttext/v1 --form "file=@mydoc.doc" --form "extract_text=false"

curl -X POST http://api.havenondemand.com/1/api/[sync|async]/extracttext/v1 --form "file=@mydoc.doc" --form "extract_meta=false"

The Text Extraction API is used by other APIs, to extract content from files to use in further analysis. You can also use the Text Extraction API to extract the contents of the files that you have retrieved from a container file using the Expand Container API.

curl -X POST http://api.havenondemand.com/1/api/[sync|async]/detectsentiment/v1 --form "file=@mydoc.doc"

Synchronous
https://api.havenondemand.com/1/api/sync/extracttext/v1
Asynchronous
https://api.havenondemand.com/1/api/async/extracttext/v1
Authentication

This API requires an authentication token to be supplied in the following parameter:

Parameter Description
apikey The API key to use to authenticate the API request.
Parameters

This API accepts the following parameters:

Required
Name Type Description
file
array<binary> The file that you want to extract text from.
reference
array<string> A Haven OnDemand reference obtained from either the Expand Container or Store Object API. The corresponding document is passed to the API.
url
string A publicly accessible HTTP URL to the file to extract text from.
Optional
Name Type Description
additional_metadata
array<json> A JSON object containing additional metadata to add to the extracted documents. This option does not apply to JSON input. To add metadata for multiple files, specify objects in order, separated by an empty object.
extract_metadata
boolean Whether to extract metadata from the file. Default value: true.
extract_text
boolean Whether to extract text from the file. Default value: true.
extract_xmlattributes
boolean Whether to extract XML attributes from the file. If your content is in XML, you can set this parameter to true if you want to extract the contents of XML tag attributes. For example, for the tag <xml element="attributeValue">My example text</xml>, it extracts the text 'attributeValue My example text'. Set this parameter to false if you want to extract only the tag contents (in this case, just 'My example text'). Note: the Text Extraction API never extracts the names of the XML tags and attributes. Default value: false.
password
array<string> Passwords to use to extract the files.
reference_prefix
array<string> A string to add to the start of the reference of documents that are extracted from a file. This option does not apply to JSON input. To add a prefix for multiple files, specify prefixes in order, separated by a space.

This API returns a JSON response that is described by the model below. This single model is presented both as an easy to read abstract definition and as the formal JSON schema.

Asynchronous Use

Additional requests are required to get the result if this API is invoked asynchronously.

You can use /1/job/status/<job-id> to get the status of the job, including results if the job is finished.

You can also use /1/job/result/<job-id>, which waits until the job has finished and then returns the result.

Model
This is an abstract definition of the response that describes each of the properties that might be returned.
Text Extraction Response {
document ( array[object] )
}
Model Schema
This is a JSON schema that describes the syntax of the response. See json-schema.org for a complete reference.
{
    "properties": {
        "document": {
            "items": {
                "anyOf": [
                    {
                        "properties": {
                            "content": {
                                "type": "string"
                            },
                            "content_type": {
                                "items": {
                                    "type": "string"
                                },
                                "type": "array"
                            },
                            "doc_iod_reference": {
                                "type": "string"
                            },
                            "document_attributes": {
                                "items": {
                                    "type": "string"
                                },
                                "type": "array"
                            },
                            "import_original_encoding": {
                                "items": {
                                    "type": "string"
                                },
                                "type": "array"
                            },
                            "keyview_class": {
                                "items": {
                                    "type": "integer"
                                },
                                "type": "array"
                            },
                            "keyview_type": {
                                "items": {
                                    "type": "integer"
                                },
                                "type": "array"
                            },
                            "name": {
                                "type": "string"
                            },
                            "original_size": {
                                "items": {
                                    "type": "integer"
                                },
                                "type": "array"
                            },
                            "parent_iod_reference": {
                                "type": "string"
                            },
                            "reference": {
                                "type": "string"
                            }
                        },
                        "required": [
                            "reference",
                            "doc_iod_reference"
                        ],
                        "type": "object"
                    },
                    {
                        "properties": {
                            "doc_iod_reference": {
                                "type": "string"
                            },
                            "error": {
                                "properties": {
                                    "error": {
                                        "type": "integer"
                                    },
                                    "reason": {
                                        "type": "string"
                                    }
                                },
                                "required": [
                                    "error",
                                    "reason"
                                ],
                                "type": "object"
                            },
                            "parent_iod_reference": {
                                "type": "string"
                            },
                            "reference": {
                                "type": "string"
                            }
                        },
                        "required": [
                            "reference",
                            "doc_iod_reference",
                            "error"
                        ],
                        "type": "object"
                    }
                ],
                "type": "object"
            },
            "type": "array"
        }
    },
    "required": [
        "document"
    ],
    "type": "object"
}
https://api.havenondemand.com/1/api/sync/extracttext/v1
/api/api-example/1/api/sync/extracttext/v1
Examples
See this API for yourself - select one of our examples below.
Web Page
Word Doc
PDF File
Parameters
Required
Select files Change Remove
Add another value
Optional
Name Type Value
additional_metadata
array
Add another value
extract_metadata
boolean
(Default: True)
extract_text
boolean
(Default: True)
extract_xmlattributes
boolean
(Default: False)
password
array
Add another value
reference_prefix
array
Add another value


ASync – Response An error occurred making the API request
Response Code:
Response Body

	
Making API Request…
Checking result of job

To try this API with your own data and use it in your own applications, you need an API Key. You can create an API Key from your account page - API Keys.

Output Refresh An error occurred making the API request View Input
Rendered RawHtml Response
Result Display
Response Code:
Response Body:

			
Make this call with curl


If you would like to provide us with more information then please use the box below:

We will use your submission to help improve our product.