Text Tokenization and Processing in Text Indexes
Details of linguistic and other processing performed on unstructured text.

Text Tokenization and Processing in Text Indexes

The Haven OnDemand text indexes are stores for documents, designed for easy retrieval.

When you add a file to a text index, Haven OnDemand processes the document content. It processes the document fields, for example to specify content that is a title or a date. For content that occurs in index type fields, it performs additional processing to extract the terms in the document and store them for retrieval.

The following page explains some of the processing that is performed during indexing, and some additional configuration that affects how you can search for items in Haven OnDemand text indexes. The following sections apply to content in index fields in all flavors of text index. See Index Flavors for details of the index fields configured for each flavor.


Stemming

Stemming is the process of reducing a word to its linguistic root. The purpose of this reduction is to find a base term, so that a search can be expanded to include all forms of the term. For example, you generally want a search for the word elections to match a document that contains the word election. As long as the two terms stem to the same form, both return in the search.

During indexing, Haven OnDemand stems each term, and stores the stem as well as the unstemmed term. During querying, Haven OnDemand stems the query term, and matches it against the stored stems in the index.

Haven OnDemand uses a stemming algorithm that applies the same stemming rules for terms in all languages, both for documents in the index and for the query text. This option ensures that searching for a particular term returns relevant documents in any language. However, it is not a form of translation, and only terms that are common to multiple languages tend to stem the same way. A search for a word returns documents in a different language only when they contain a word that has the same stem.


Special Characters

In Haven OnDemand text indexes, there are three types of characters:

Text characters Letters and numbers, including logograph characters from Asian writing systems.
Separator character Characters that separate two words, such as spaces, tabs, and line breaks.
Non-separator character Other characters, such as punctuation.

Text characters include the Roman, Cyrillic, Arabic, Hebrew, Chinese, Japanese, Thai, and Korean alphabets and standard character sets.

In the index, separator characters are converted to spaces, which mark a break between terms. Non-separator characters are deleted. If parts of text are separated only by non-separator characters, it becomes a single term.

Separator characters include spaces, tabs, and line breaks, as well as other special characters such as the @ symbol. Most other punctuation characters are non-separators.

The period (.), hyphen (-), and apostrophe (') characters are special separator characters. If parts of text are separated by these characters, the text on either side of the character is indexed as separate words, and the whole word (including the separator) is also indexed.

Example: Email Addresses

If your document contains the following email address:

joe.smith@example.com

Haven OnDemand indexes this as several terms: joe.smith, joe, smith, example.com, example, com

This means that you can search for joe.smith or Joe Smith and return this email address.


Stop Lists

In most languages, the 100 most common words (for example, the, a, of in English) make up around 50% of all text. These words do not usually provide much information, and are not useful for searching content.

Haven OnDemand has a stop list, which contains many of these common words. Haven OnDemand ignores these words when it indexes a document, which reduces the size of the index and can improve query performance. It also removes stop words from query text before searching for matching documents.

For most text index flavors, Haven OnDemand uses an international stop list. This stop list contains common stop words for all languages, but does not include terms that are stop words in one language but useful terms in another language.

Note: The Categorization text index flavor does not use a stop list.

For a full list of the terms included in the Haven OnDemand stop list, see Stop List for Text Indexes.

Query Terms

When you send a query to Haven OnDemand, using the Query Text Index API, it uses a maximum of 35 query terms to match documents. If your query text contains more than 35 terms, Haven OnDemand discards the additional terms.

In the same way, the Find Similar API uses the first 35 unique terms from your input document or text to find and suggest other similar documents.

If you use a wildcard in a query, the expanded wildcard values do not count towards the number of query terms. Haven OnDemand expands any wildcard to a maximum of 10000 terms. However, Haven OnDemand recommends that you avoid using wildcard queries that expand to a large number of values, because they can reduce query performance.

The Haven OnDemand text indexes also have a maximum term length. For terms longer than 20 characters Haven OnDemand indexes a truncated version. You can still retrieve the term in a query, because Haven OnDemand truncates the query text term in the same way. However, you must consider this length limit if you have a large number of terms with a common prefix.


Query Result Weighting and Relevance

By default, query text index results return in relevance order. To calculate the relevance, Haven OnDemand takes into account the APCM weight (statistical weight) assigned to each of the matched terms, and the number of times those terms occur in the result document.

Matching more of the terms generally results in a higher ranking than matching a few of the terms multiple times. Additional factors such as capitalization, stemming, and proximity also adjust the weighting (but again, normally less than matching an additional term).

Haven OnDemand also boosts the relevance of a document if a query term occurs in the document title, rather than in the body of the document.