Advanced Haven OnDemand Unstructured Text Indexing
Create, customize and use Haven OnDemand text indexes

Advanced Haven OnDemand Unstructured Text Indexing

It is easy to create a text index in Haven OnDemand and index some data. If you want to create a test setup with some documents, you can be up and running in a few minutes (see Haven OnDemand Unstructured Text Indexing).

However, when you want to start using Haven OnDemand for more serious applications, you might want to use some of the more advanced features of text indexes. The following sections describe how to plan and create your text index, and index and manage your data, and maintain the index.

Note: Before you read this page, you might want to check the Text Indexes - Key Concepts page, and Introduction to Haven OnDemand Text Indexing.

Plan and Create an Index

When you create a text index, you must specify a flavor, and any custom fields that you want to create. You cannot modify these values after you create the index.

The following sections provide a guide to the main points you need to consider to allow you to create the right text index for your application.

What Do You Want to Search For?

Search has many applications, and the Haven OnDemand search APIs are very customizable, to allow you to do everything from a simple keyword search across your data set, to search filtering, suggestion, and advanced query manipulation.

Most of these functions are easy to set up, but some require a bit of planning. The most important considerations before you index are the document fields. In particular:

  • If you want to use the field_text parameter with the Query Text Index API, you must make sure that the fields you want to search for have the correct type. The Field Text Operators page lists the field types required for each operator. The Index Field Types page describes these field types.
  • If you want to use the Get Parametric Values API to provide faceted search, you must make sure that the fields you want to filter on are parametric or date type.

All index flavors have a set of standard fields, with configured field types. Often you can use these standard fields for your data. In other cases, you might need to create additional custom fields.

The number of custom fields, and the field types that you need affect the flavor of text index that you choose.

Choose Your Fields

The following JSON is an example index document (there is more detail on this format below).

{
   "document" : [
      {
	     "title" : "Exciting News from the UK",
		 "reference" : "Events-UK-AliceWriter",
		 "author" : "Alice Writer",
		 "enriched_place" : "UK",
		 "date" : "2016-04-01",
		 "lat" : "52.2",
		 "lon" : "0.12",
		 "category" : "events",
		 "event_id" : "1234567",
		 "event_price" : "12.50",
		 "content" : "In breaking news from Cambridge, UK, we find that some exciting things have been happening!",
		 "event_details" : "Town centre, near the market square"
	  }
   ]
}

This document contains a lot of different fields, which you might want to use in different ways. The following table describes some of these search types, and gives examples of which fields from this JSON document you might want to search in this way.

Type of Search Example Possible Fields from Example Document Required Field Type
Search for keyword, free text, Boolean or proximity expression to find documents that contain related concepts. text=Cambridge town centre events content, title, event_details index
Search for a document that contains a specific date or a date within a range in a particular field. field_text=RANGE{01/04/2016,01/05/2016}:date date date
Search for a numeric value (such as an ID), or a range of numeric values (such as a price range). field_text=EQUAL{1234567}:event_id
field_text=NRANGE{10,15}:event_price
event_id, event_price numeric
Filter a search by the value in a particular field. getparametricvalues/v1/field_name=author author, category parametric
Search for documents with location information in or near a particular place. field_text=DISTSPHERICAL{52,0.1,10}:lat:lon lat, lon numeric
Search for documents with an exact value in a particular field. field_text=MATCH{Alice Writer}:author author, category, enriched_place, reference parametric, numeric, or reference

Some of these examples use fields that are configured with a particular field type by default in the search index flavors. For example, content is already an index type field in the standard flavor, and author is parametric type. For others, you must create custom fields when you create the text index.

For details of the standard fields created by default with a new index, see Index Flavors and the index flavor pages.

Choose a Flavor

There are several different flavors of Haven OnDemand text indexes available. For a full list, see Index Flavors.

The Categorization and Query Manipulation text indexes are configured with particular API sets in mind, and the documents that you index are slightly different.

For general queries, the main flavors are Explorer, Standard, and Custom_Fields. These are all based on a similar configuration set, with some differences:

  • Standard allows you to add a limited number of custom index fields and parametric fields, on top of the normal configuration (which includes several default fields). If you have looked at your data and decided you want to be able to filter on only a few custom fields (if any), Standard is probably suitable.
  • Explorer has a smaller capacity (and a correspondingly lower resource cost). Otherwise, it is identical to Standard. This flavor is intended for test setups. For most serious uses you probably want to use Standard or Custom_Fields.
  • Jumbo has a larger capacity (and a correspondingly higher resource cost). Otherwise, it is identical to Standard. This flavor is intended for very large systems.
  • Custom_Fields, as the name suggests, lets you set up more of your own fields with special types. As well as allowing you to configure more index and parametric type fields, you can also configure numeric fields, and expire date fields, which allow you to automatically expire old content on a particular date. If you want to be able to set more than five filters, or use Field Text Operators that require a numeric type, you should create a Custom_Fields index.

The following table shows the number of custom fields you can create with each different type for the main text index flavors.

Flavor Resource Cost Index Parametric Numeric Expire Date
Explorer 1 5 5 0 0
Standard 10 5 5 0 0
Jumbo 20 5 5 0 0
Custom_fields 12 15 10 10 3

For more information about these field types, see Index Field Types.

Create the Index

You can create a text index by using the Create Text Index wizard on the Text Indexes account page, or by using the Create Text Index API.

Note: The following procedure creates a text index without custom fields. For information about how to create a text index with custom fields, see Custom Fields.

For full documentation, see Create Text Index.

http://api.havenondemand.com/1/api/[sync|async]/createtextindex/v1?

This API has two required parameters:

  • index. The name of the new index.
  • flavor. The flavor of the index. The flavor defines the size and configuration of the index. For most uses, indexing normal documents for search, you typically use the Standard flavor. For testing, you can use the smaller Explorer flavor. (The full list of flavors is here.)

Optionally, you can add:

  • description. A description for your index. The description returns when you list your indexes, and makes it easier to know which index is for what.

For information about how to create a text index by using the Create Text Index wizard, see Manage Text Indexes.

Custom Fields

The Create Text Index wizard on the Text Indexes account page allows you to set custom fields for your text indexes. You can also create custom fields by using the Create Text Index API.

The following parameters are available to allow you to set custom fields, and the flavors where these parameters are available.

Parameter Field Type Description Flavors
index_fields Index fields contain document content, which receives linguistic processing for keyword and conceptual search. Standard, Explorer, Custom_Fields and Jumbo
parametric_fields Parametric fields contain values that you want to use for search filtering and exact text matches. Standard, Explorer, Custom_Fields, and Jumbo
expire_date_fields Expire Date fields contain a date that you want to use to automatically expire the document. Custom_Fields
numeric_fields Numeric fields contain numeric values that you want to use for searches. Custom_Fields

You can set each of these parameters multiple times to create multiple fields with a particular type (that is, each parameter is an array type). Document field names are case insensitive (title is equivalent to TITLE and Title), and cannot contain spaces.

For example:

http://api.havenondemand.com/1/api/sync/createtextindex/v1?index=mycustomfieldsindex&flavor=custom_fields&index_fields=summary&index_fields=body&parametric_fields=language&numeric_fields=number_recipients

This example creates two index fields, summary and body, a parametric field language, and a numeric field number_recipients.

For details of the standard fields created by default with a new index, see Index Flavors and the index flavor pages.

Create Documents

Index documents can come from many different sources, and you have several options for how to get them into a Haven OnDemand text index.

When you use a Connector, the Connector automatically sends the documents to a configured text index. All the other methods for creating and uploading documents require you to call the Add to Text Index API. See Add Data to the Index.

The following sections provide a bit more detail on these options.

Upload a File

Haven OnDemand can automatically extract the text from many different file types (such as word processing documents, PDF, spreadsheets, and so on). It uses HPE KeyView to open the file and find the text content.

You can try this yourself by using the Text Extraction API.

The Add to Text Index API allows you to upload a file directly in binary format (file parameter). The API uses the Text Extraction API to get the text content, and then sends it to the text index.

Alternatively, you can upload files to Haven OnDemand by using the Store Object API. This API provides an object store reference, which you can submit to the Add to Text Index API (reference parameter). Haven OnDemand extracts the content from the object store (using the Text Extraction API to get the text from it, if required), and adds it to the text index.

If your document is already available on a publicly accessible URL, you can also submit a URL to the Add to Text Index API (url parameter). This option can be a link to a file, or a Web page that you want to index.

Convert Files to JSON

Whichever method you choose to send content to your Haven OnDemand text index, the index document is in JSON format when it reaches the text index. If you want control over the fields in your documents, you can create the JSON objects yourself.

One way to do this is to use the Text Extraction API on your documents. By default, the API extracts text and metadata from the file. You can add custom metadata to the document by using the additional_metadata parameter, which accepts a JSON object (or an array of JSON objects, if you upload multiple files). You can send this JSON output to the Add to Text Index API.

Tip: You can also use the additional_metadata parameter directly with the Add to Text Index API. See Add Data to the Index, below.

In some cases, you might want to create the JSON objects manually. There are some rules about the structure of the JSON objects that you use, to ensure that Haven OnDemand can read it and process it correctly. For example, the index documents must be in a document array, and the reference and content fields can occur only once per index document. For more information, see the Add to Text Index API documentation.

The JSON object has the following form:

{
   "document" : [
      {
         "title" : "This is my document",
         "reference" : "mydoc1",
         "myfield" : ["a value"],
         "content" : "A large block of text, which makes up the main body of the document."
      }, 
      {
         "title" : "My Other document",
         "reference" : "mydoc2",
         "content" : "This document is about something else."
      }
   ]
}
  • document is an array of objects, each of which is a document that you want to be able to return individually. You can add multiple documents in the same document array.
  • reference is a document reference, which you can use to identify the document. If you do not include a reference, Haven OnDemand automatically generates one.
  • content is the main part of the data you store. Use this field to store the bulk of the document that you want to be able to search by text matching. For example, typically you would use content to store the body of emails, the text of a book, or the main part of a Wikipedia page.
  • myfield is a custom field name, which allows you to customize your search more. For example, myfield might be some document metadata that you want to store so that it returns in your search, but that you do not want to search for directly. You can also use some predefined field names, which have particular properties that allow you to search and filter by values in your fields. For more information about the document fields, see Index Field Types. For a list of the predefined field names, see the documentation for each flavor Index Flavors.

Use a Connector

A Connector is a tool that connects to a repository and crawls it for content. It automatically extracts the documents from the repository, and sends them to the Haven OnDemand text index. In the same way that the files you upload go through the Text Extraction API, the connector also uses the Text Extraction API to get the text content.

You can schedule your connectors to automatically pick up changes in the repositories, and index new and updated documents.

Connectors are specific to a repository, not a file type. For example, you need a different connector for indexing from a file system to the one for indexing from a Web site. All connector types can process and index all supported file types.

Add Data to the Index

After you have your documents, you can add them to the index.

When you want to add a file directly, you can use the Text Indexes Account page to drag and drop a file, and Haven OnDemand will extract the text, process it, and add it to your index. Otherwise, you can use the Add to Text Index API, with the appropriate parameters according to the input type you want to use.

For full documentation, see Add to Text Index.

http://api.havenondemand.com/1/api/[sync|async]/addtotextindex/v1?

The API has two required parameters:

  • index. The text index that you want to add the data to (that is, the one you created in the previous section).
  • The input, which can be one of the formats already discussed: file, json, reference, or url.

The API also has some optional parameters, described in the following sections.

Additional Metadata

The additional_metadata parameter allows you to add extra metadata to your documents when you index them. This parameter is only available if you are submitting a file, URL, or object store reference. You can use it to make sure that the documents contain some additional information that does not exist in the original file, such as a category or language information.

http://api.havenondemand.com/1/api/[sync|async]/addtotextindex/v1&indexes=myfirstindex&file=HowTo_Index.html&additional_metadata={"details":"Haven OnDemand documentation"}

This example indexes the file HowTo_Index.html to myfirstindex, and adds the following JSON to the index document:

{
	"details" : "Haven OnDemand documentation"
}

After you index the file, you can check the format of the document in the index by using the Get Content API.

For example:

https://api.havenondemand.com/1/api/sync/getcontent/v1?index_reference=HowTo_Search.html&indexes=myfirstindex&print=all

This call returns the index document:

{
  "documents": [
    {
      "reference": "HowTo_Search.html",
      "section": 0,
      "index": "myfirstindex",
      "content_type": [
        "text/plain"
      ],
      "document_attributes": [
        "0"
      ],
      "file_size": [
        "22140"
      ],
      "details": [
        "Haven OnDemand documentation"
	  ],
		...

You can see the additional details field in the document.

Check for Duplicates

When you index a document, Haven OnDemand can automatically check your index for another document that has the same reference.

Haven OnDemand does not require the reference field in your documents to have a unique value (it uses a separate key to uniquely identify index documents). For example, if you use a document title as the reference, you can have cases where two or more documents from different places have the same title, and so they will have the same reference.

At index time, Haven OnDemand can detect cases where you have multiple documents with the same reference, and you can decide what you want to do with them, by using the duplicate_mode parameter.

This parameter has the following possible values:

  • duplicate. Allow duplicate references. Use this option if you might have different documents with the same reference value. In this case, Haven OnDemand keeps both versions of the file.
  • replace. Remove existing documents with the same reference. Use this option if you generally do not have documents with the same reference, and you want to index a new version of the document over the old one (for example, to index an updated version).

The default value is Replace: when you upload a new document, any existing document with the same reference is overwritten.

Add a Common Prefix

When you add a file, reference or url to the text index, you can specify the reference_prefix parameter. This parameter specifies a prefix to use in the reference value that Haven OnDemand assigns to the file.

For example, you can specify a prefix to all files that you index from a common repository, to distinguish them from a file with the same name in a different repository. You can then use this value in combination with the duplicate_mode parameter to index new versions of the files.

http://api.havenondemand.com/1/api/[sync|async]/addtotextindex/v1&indexes=myfirstindex&file=GettingStarted.html&reference_prefix=webdocs

This call returns the following response:

{
  "index": "myfirstindex",
  "references": [
    {
      "reference": "webdocs_GettingStarted.html",
      "id": 7
    }
  ]
}

Delete Documents from an Index

You can delete documents from a text index by using the Delete from Text Index API.

For full documentation, see Delete from Text Index API.

http://api.havenondemand.com/1/api/[sync|async]/deletefromtextindex/v1?

This API has one required parameter, index, which specifies the text index that you want to delete the document from.

It has the following optional parameters:

  • index_reference. The reference of the document that you want to delete. You can specify an array of values to delete multiple documents in one call.
  • delete_all_documents. Set this to true to empty your text index.

There is no confirmation step for deleting documents. If you delete a document by mistake, you can reindex the file, or restore the index to an earlier state by using the Restore to Text Index API. See Restore a Text Index.

Manage Your Text Indexes

The following section describes some of the APIs that you can use to manage your text indexes.

List your Text Indexes

To get the list of text indexes for your account, you can go to the Account page, and click Text Indexes. This page provides a list of your text indexes.

You can also use the List Resources API to find the names and descriptions of your text indexes programmatically. This API also returns the details for any connectors and query profiles that you have created. To restrict the results to just text indexes, set the type parameter to Content.

For example:

http://api.havenondemand.com/1/api/[sync|async]/createtextindex/v1?type=Content

The API response includes details for the public resources as well as your private resources. For the public resources, it returns only the resource name, description, and type. For your private resources, it also returns the flavor, the creation date, the display name, and the resource UUID.

{
   "public_resources": [
      {
         "resource": "wiki_ita",
         "description": "https://it.wikipedia.org/",
         "type": "content"
      },
      ...
   ],
   "private_resources": [
      {
         "resource": "myfirstindex",
         "type": "content",
         "flavor": "explorer",
         "description": "A description for the text index.",
         "date_created": "Fri Dec 18 2015 12:22:32 GMT+0000 (UTC)",
         "display_name": null,
         "resourceUUID": "abcdef12-3456-7890-abcd-ef1234567890"
      }
   ]
}

Find the Status of an Index

In the Text Indexes section of the Account page, you can click a text index to view more details, such as the number of indexed documents and the size of the index.

You can also retrieve this information by using the Index Status API. This API accepts a single parameter, index, which you must set to the name of the index. For example:

http://api.havenondemand.com/1/api/[sync|async]/indexstatus?index=myfirstindex

This call returns information for the index, myfirstindex:

{
   "total_documents": 0,
   "total_index_size": 128,
   "24hr_index_updates": 0,
   "component_count": 1
}

In this example, the index is empty.

Restore a Text Index

Haven OnDemand automatically backs up your text indexes on a regular schedule, as well as keeping a log of the Add to Text Index API and Delete from Text Index API requests that you send. If there is a problem with your index, or you want to return to an earlier state for any reason, you can restore from a back up by using the Restore Text Index API.

For full documentation, see Restore Text Index.

http://api.havenondemand.com/1/api/[sync|async]/restoretextindex?

The Restore Text Index API does not overwrite the original index. It creates a new text index, and copies the configuration and index data from the back up of the original.

Note: If you want to restore an index, you must have sufficient static resource units to create a new index with the same flavor as the index you want to restore.

The Restore Text Index API has three required parameters:

  • index. The name of the index you want to restore.
  • date. The date and time you want to restore to (in any ISO 8601 date format).
  • new_index. The name of the index you want to create for the restored content.
http://api.havenondemand.com/1/api/[sync|async]/restoretextindex?index=myfirstindex&new_index=myrestoredindex&date=2016-01-18 00:00:00

This call creates the index myrestoredindex to restore the state of myfirstindex at midnight on 18 January 2016. It returns the following success response:

{
   "restored": "myrestoredindex"
}

Delete a Text Index

You can delete your text indexes by using the Text Indexes section of the Account page, or by using the Delete Text Index API.

For full documentation, see Delete Text Index API.

Deletion is permanent; you cannot restore a text index after you delete it. Both methods for deleting the text index ask for confirmation to make sure that you do not delete it accidentally.

On the Account page, you can click the text index that you want to delete, then click Delete Text Index. A dialog box asks for you to confirm the deletion. When you click Delete again, it deletes the index.

For the Delete Text Index API, you must call in two stages:

https://api.havenondemand.com/1/api/sync/deletetextindex/v1?index=mysecondindex

The API response includes the confirm code:

{
   "deleted": false,
   "confirm": "1453293809:def0253a1f0fef41bf3fa9ac1e111e21"
}

You then take the confirm code from the output and use it in the confirm parameter for the API:

https://api.havenondemand.com/1/api/sync/deletetextindex/v1?index=mysecondindex&confirm=1453293809:def0253a1f0fef41bf3fa9ac1e111e21

The API then returns a confirmation that the text index was deleted successfully.

{
   "deleted": true,
   "index": "mysecondindex"
}

Further Reading

The following pages provide more information about text indexing: