Text Indexes Key Concepts
Haven OnDemand unstructured text indexing key concepts

Text Indexes - Key Concepts

The Haven OnDemand Search APIs operate on the content of text indexes.

What is a Text Index?

The text index is the store of document data, internally organized to make it easy for Haven OnDemand to find information for you.

The text index stores only the extracted text and metadata enrichments from your files, allowing you to search, organize, and filter the content. It is not a repository for document originals; if you want to read the whole document (with any embedded images, and so on), you can refer back to the original repository.

Haven OnDemand provides several standard text indexes, such as Wikipedia in several languages, and News websites. You can also create your own text indexes, to store any data that you want to search with Haven OnDemand APIs.

Documents

Haven OnDemand stores content in index documents. This is the searchable unit (that is, when you search for something, each result or hit is an index document).

The index document can come from a variety of different sources, and might be any length: it could be an email, a Wikipedia page, a product entry from an online catalog, a PDF, or a single tweet.

Regardless of its original format, Haven OnDemand extracts the text into a JSON object. Each attribute in the JSON object is a field in the text index.

See Also: Create Documents

Fields

The fields contain the information and content in your document. The fields can contain anything from a single character to the main text of your documents.

The most important fields are:

  • The reference. The reference can be any value, but it is usually a handle to allow you to identify the original document easily; for example, the original URL or file path.
  • The content. The content is the bulk of your document. For example, it might be the body of an email, the main description of a product, or the visible text in a PDF.

You can have as many different fields as you like, but Haven OnDemand gives some of them special treatment, according to the field type.

The field type determines how Haven OnDemand processes the content, and what you can do with it. For example, the content field is an index field. Index fields have special text processing, where Haven OnDemand stores information about the terms in the field so that you can easily perform a search for a keyword or phrase. Other fields are set up to allow parametric (faceted) search, or to optimize searching for numeric or date values.

If a field does not have a special field type, it is a store only field. Haven OnDemand keeps store only field content, but does not apply any additional processing. This field type is useful for content that you want to have available with a document, but which is not useful for search. For example, some document metadata (such as the content length) and related file information (such as an image URL) is kept in store only fields.

For more information about the available field types, see Index Field Types.

Text Index Flavors

Every text index has a flavor, which you specify when you create it. The flavor of a text index determines a few different properties:

  • Index size. The Explorer flavor has only a small amount of storage, intended for small test set-ups. The Standard flavor is larger.
  • Field configuration. Different flavors have different standard field configuration, which determines what you can index. In addition, some flavors allow you to configure custom fields with a particular field type.
  • Function. The Explorer and Standard flavors are intended for normal document storage and retrieval, while the Categorization and Query Manipulation flavors are set-up for use with specific API sets.

For more information about the available flavors, see Index Flavors.

The Indexing Process

During the indexing process, Haven OnDemand extracts the text from the index document and processes the fields.

For fields with the index field type, it tokenizes the text into terms, removes stop words (words that are too common to add meaning), and processes information about the terms. For example, it stores the stem (the linguistic root of the word), and information about how many times a particular term occurs in a document. For more information, see Text Tokenization and Processing in Text Indexes.

For fields with special field types, it processes the field appropriately. For example, for numeric type fields it optimizes retrieval by numeric range, and for parametric type fields, it optimizes exact string matching and retrieval.