Find Similar

Finds documents that are conceptually similar to your text or a document.

The Find Similar API returns documents in the HPE Haven OnDemand databases that are similar to text or a document that you provide. The API extracts the "best" terms from your input, calculating a statistical weight according to their importance in the document. It uses these terms to query the index for similar documents, ranked on number of matches adjusted with the weights.

Quick Start

You can submit either text, an index reference, a file, an object store reference, or a URL.

/1/api/[async|sync]/findsimilar/v1?index_reference= or text= or file= or reference= or url=

  • index_reference. You provide the reference for a document in an HPE Haven OnDemand index, and the API returns details of similar documents. You can find document references in the result from the Query Text Index API.

  • text. You provide some plain text, and the API returns documents that are similar to the best terms in your text.

  • file. You provide a file, and the API extracts the text from the file, and then treats the content in the same way as text.

  • reference. You provide the reference to a document in the Haven OnDemand object store, and the API returns details of similar documents from the Haven OnDemand index.

  • url. You provide a publicly accessible HTTP URL. The API retrieves the document from the specified URL and returns details of similar documents.

Note: API input is subject to a maximum size quota. If you upload text or a file that is too large, the API returns an error. For more information, see Rate Limiting, Quotas, Data Expiry, and Maximums.

Note: For more on the conceptual underpinnings of this and other types of text querying, see the Conceptual Search section in Use Haven OnDemand Search Functionality.

The Find Similar API has many of the same options as the Query Text Index API. The optional parameters allow you to return a more specific results set, or to organize your results:

  • The print, print_fields, and sort parameters allow you to specify what data you want to return.
    • The print and print_fields parameters determine which document fields you want to print in the response output.
    • The sort parameter determines the order that the results return in.
  • The highlight parameter lets you mark up the result document text with HTML tags that highlight the query terms or sentences that contain the query terms.

  • The summary parameter includes a summary of the response items in the output. The summary can be of three kinds:
    • concept: a collection of sentences from the response item which match the concepts in the input.
    • context: a collection of sentences from the response item which match the search terms in the input.
    • quick: the first few sentences of the response item.

Haven OnDemand provides a number of Public Text Indexes that you can query, including Wikipedia and news sources in various languages, and datasets of patents and scientific articles. You can also create your own text indexes.

Example

In this example, the input text (abridged in the illustration) is from the abstract of a scientific research project. The API searches the English Wikipedia for similar content, and provides a quick summary.

/1/api/sync/findsimilar/v1?text=This+project+explores+the+nature+of+language+acquisition+in+computers%2C+guided+by+techniques+similar+to+those+used+in+children....&indexes=wiki_eng&sort=relevance&summary=quick"

The results are sorted in order of relevance based on the weight property. The links section of each return lists the stem terms that match in the result. Here is the first response:


{
  "documents": [
    {
      "reference": "http://en.wikipedia.org/wiki/Outline of natural language processing",
      "weight": 86.53,
      "links": [
        "BIGR",
        "RECURS",
        "SYNTAX",
        "ALYZED",
        "MORPHOLOG",
        "SENTEN",
        "VERIF",
        "UNDERSTAND",
        "ANALY",
        "INPUT",
        "JAV",
        "SUGGEST",
        "FREQUEN",
        "ACQUISIT",
        "MINIM",
        "IMPLEMENT",
        "PROCESS",
        "METHOD",
        "TEXT",
        "DIFFER",
        "PATT",
        "TECHNI",
        "DISTINGUISH",
        "AIM",
        "POTENT",
        "DEFIN",
        "COMPUT",
        "SPANISH",
        "DETERMIN"
      ],
      "index": "wiki_eng",
      "title": "Outline of natural language processing",
      "summary": "<span style='background-color:green; color:white'>The following outline is provided as an overview of and topical guide to natural language processing: Natural language processing – computer activity in which computers are entailed to analyze, understand, alter, or generate natural language.</span> This",
      "wikipedia_category": [
        "Natural language processing",
        "Outlines"
      ]
    }
  ]
}
Synchronous
https://api.havenondemand.com/1/api/sync/findsimilar/v1
Asynchronous
https://api.havenondemand.com/1/api/async/findsimilar/v1
Authentication

This API requires an authentication token to be supplied in the following parameter:

Parameter Description
apikey The API key to use to authenticate the API request.
Parameters

This API accepts the following parameters:

Required
Name Type Description
file
binary A file containing the query text. Multipart POST only.
index_reference
string The reference to a document in the Haven OnDemand database.
reference
string A Haven OnDemand reference obtained from either the Expand Container or Store Object API. The corresponding query text is passed to the API.
text
string The query text.
url
string A publicly accessible HTTP URL from which the query text can be retrieved.
Optional
Name Type Description
absolute_max_results
number The absolute maximum number of results to return for this query. Default value: 6.
end_tag
string The closing HTML tag to use to highlight a match. If omitted, this is generated automatically from the start_tag.
field_text
string The fields that result documents must contain, and the conditions that these fields must meet for the documents to return as results. See Field Text Operators.
highlight
enum The highlighting option to use for the result text. Default value: off.
indexes
array<resource> The name of the Haven OnDemand text index that you want to search for results. You can use the public datasets, or your own text indexes. See Public Text Indexes. Default value: [wiki_eng].
max_date
string The latest creation date or time that a document can have to return as a result. See Parameter Date Formats.
max_page_results
number The maximum number of results to return for this query from the absolute number of results returned. If you have set the start parameter, max_page_results sets the maximum number of results to return from the total results set.
min_date
string The earliest creation date or time that a document can have to return as a result. See Parameter Date Formats.
min_score
number The minimum percentage relevance that results must have to the query to return. Default value: 0.
print
enum The types of fields and content to display in the results. Default value: fields.
print_fields
string The names of fields to print in the results.
query_profile
resource The name of the query profile that you want to apply.
sort
enum The criteria to use for the result display order. By default, results are displayed in order of relevance. Default value: relevance.
start
number The number of the first document to display. Default value: 1.
start_tag
string The opening HTML tag to use to highlight a match. Default value: <span style="background-color: yellow">.
summary
enum The type of summary to create for result documents. Default value: off.
total_results
boolean Set to true to return extra information about the total number of result documents, and the total number of documents and document sections in the query databases. Default value: false.
Enumeration Types

This API's parameters use the enumerations described below:

highlight
The highlighting option to use for the result text.
off No highlighting.
terms Terms that match the query text.
sentences Sentences that contain query terms.
print
The types of fields and content to display in the results.
all All fields.
all_sections All fields and all sections.
date Date fields.
fields Print the fields listed in the print_fields parameter.
none Do not print content fields.
no_results Do not print results.
parametric Parametric fields.
reference Reference fields.
sort
The criteria to use for the result display order. By default, results are displayed in order of relevance.
relevance Relevance order (most relevant first).
reverse_relevance Relevance order (least relevant first).
date Date order (most recent first).
reverse_date Date order (oldest first).
autn_rank Order by the standard relevance adjustment field.
off No sorting.
summary
The type of summary to create for result documents.
concept Concept Summary
Contains sentences that are typical of the result content. These sentences can be from different parts of the result document.
context Context Summary
Contains sentences that are typical of the result content, biased by the terms in the query text.
quick Quick Summary
The first few sentences of the document.
off No summary

This API returns a JSON response that is described by the model below. This single model is presented both as an easy to read abstract definition and as the formal JSON schema.

Asynchronous Use

Additional requests are required to get the result if this API is invoked asynchronously.

You can use /1/job/status/<job-id> to get the status of the job, including results if the job is finished.

You can also use /1/job/result/<job-id>, which waits until the job has finished and then returns the result.

Model
This is an abstract definition of the response that describes each of the properties that might be returned.
Find Similar Response {
documents ( array[Documents] ) The details of the returned documents.
warnings ( array[Warnings] , optional)
}
Find Similar Response:Documents {
index ( string , optional) The database that the result returned from.
links ( array[string] , optional) The terms from the query that match in the results document.
reference ( string , optional) The reference string that identifies the result document.
summary ( string , optional) The summary of the results document.
title ( string , optional) The title of the result document.
weight ( number , optional) The percentage relevance that the result document has to the original query.
}
Find Similar Response:Warnings {
code ( integer , optional)
details ( object , optional)
}
Model Schema
This is a JSON schema that describes the syntax of the response. See json-schema.org for a complete reference.
{
    "properties": {
        "documents": {
            "items": {
                "properties": {
                    "index": {
                        "type": "string"
                    },
                    "links": {
                        "items": {
                            "type": "string"
                        },
                        "type": "array"
                    },
                    "reference": {
                        "type": "string"
                    },
                    "summary": {
                        "type": "string"
                    },
                    "title": {
                        "type": "string"
                    },
                    "weight": {
                        "type": "number"
                    }
                },
                "type": "object"
            },
            "type": "array"
        },
        "warnings": {
            "type": "array",
            "items": {
                "type": "object",
                "additionalProperties": false,
                "properties": {
                    "code": {
                        "type": "integer"
                    },
                    "details": {
                        "type": "object"
                    }
                }
            }
        }
    },
    "required": [
        "documents"
    ],
    "type": "object",
    "additionalProperties": false
}
https://api.havenondemand.com/1/api/sync/findsimilar/v1
/api/api-example/1/api/sync/findsimilar/v1
Examples
See this API for yourself - select one of our examples below.
Find Similar
Find similar documents to the Wikipedia 'Cat' page.
Find Similar People
Find documents about people similar to the keyword 'Sports'.
Find Similar Companies
Find documents about companies similar to the keyword 'Travel' and provide a conceptual summary.
Parameters
Required
Optional
Name Type Value
absolute_max_results
number
end_tag
string
field_text
string
highlight
enum
indexes
array
max_date
string
max_page_results
number
min_date
string
min_score
number
print
enum
print_fields
string
query_profile
resource
sort
enum
start
number
start_tag
string
summary
enum
total_results
boolean
(Default: False)


ASync – Response An error occurred making the API request
Response Code:
Response Body

	
Making API Request…
Checking result of job

To try this API with your own data and use it in your own applications, you need an API Key. You can create an API Key from your account page - API Keys.

Output Refresh An error occurred making the API request View Input
Rendered RawHtml Response
Result Display
Response Code:
Response Body:

			
Make this call with curl


If you would like to provide us with more information then please use the box below:

We will use your submission to help improve our product.