Haven OnDemand Web Cloud Connector Quick Start
Index websites with the Web Cloud Connector

Use the Web Cloud Connector

Connectors allow you to ingest documents into Haven OnDemand text indexes from various sources.

The Web Cloud Connector enables you to index a remote website, without running a crawler, letting Haven OnDemand do all the work.

Create the Connector

The first API you must call is the Create Connector API. Calling this API creates a connector of the type of your choice.

The required parameters are:

  • connector : The name of your connector. You use this name for further updates, status checks, and to delete the Connector.

  • flavor : The type of connector you want to use. For Web Cloud connector, this value is web_cloud.

  • config : This JSON parameter contains configuration relevant to this type of connector.

  • destination : This JSON parameter specifies what to do with the output from the site you connect to the Connector. Currently, this specifies the text index that you want to index into.

Example of a Web Cloud Connector Configuration

/createconnector/v1?connector=havenondemand&flavor=web_cloud
config={ "url" : "http://www.hodsite.com" }
destination={ "action" : "addtotextindex", "index" : "hodsite" }

This configuration creates a Web Cloud Connector to follow links from www.hodsite.com, and index every page it finds into the hodsite text index.

Operate the Connector

Start and Stop a Connector Manually

The Start Connector API takes a connector name, and starts a crawl.

/startconnector/v1?connector=havenondemand

The Stop Connector API can stop a run.

Check Status and History

You can use the Connector Status API to check whether the connector is running.

/connectorstatus/v1?connector=havenondemand
{
  "connector": "havenondemand",
  "status": "PROCESSING",
  "token": "MTI3LjAuMS4xOjcyMDA6RkVUQ0g6LTEzMjcwNTE0NTU=",
  "queued_time": "24/04/2015 17:11:05 +00:00",
  "time_in_queue": 0,
  "process_start_time": "24/04/2015 17:11:05 +00:00",
  "time_processing": 0,
  "document_counts": {}
}

You can use the Connector History API to get a history of all activity on the Connector, with a range of convenient filters that you can set.

{
  "history": [
    {
      "connector": "havenondemand",
      "status": "FINISHED",
      "document_counts": {
        "errors": 0
      },
      "queued_time": "08/08/2016 05:22:26 +00:00",
      "time_in_queue": 0,
      "process_start_time": "08/08/2016 05:22:26 +00:00",
      "time_processing": 2,
      "process_end_time": "08/08/2016 05:22:28 +00:00",
      "start_time": "08/08/2016 05:22:26 +00:00",
      "token": "MTAuOC4xNi4xMDo3MjEwOkZFVENIOi0xODE5MDgwMjE1"
    },
    {
      "connector": "havenondemand",
      "status": "FINISHED",
      "document_counts": {
        "added": 2,
        "errors": 0,
        "ingest_added": 1,
        "ingest_failed": 1
      },
      "queued_time": "07/08/2016 23:21:50 +00:00",
      "time_in_queue": 0,
      "process_start_time": "07/08/2016 23:21:50 +00:00",
      "time_processing": 510,
      "process_end_time": "07/08/2016 23:30:20 +00:00",
      "start_time": "07/08/2016 23:21:50 +00:00",
      "token": "MTAuOC4xNi4xMDo3MjEwOkZFVENIOi04MDY3Njk3ODA="
    }

Start a Connector Run to Reindex a Site from Scratch

By default, the Connector crawls and indexes only new and modified pages. You can use the ignore_previous_state parameter to force the Connector to ignore previous runs, and index the site as though it was a first run. For example:

/startconnector/v1?connector=havenondemand&ignore_previous_state=true

Update the Connector Configuration

After you create a Connector, you might want to change the configuration.

The Retrieve Config API takes a Connector name, and returns its configuration.

/retrieveconfig/v1?connector=havenondemand

After you check the configuration, you can change it by using the Update Connector API.

Note:As a config object can be quite large, HPE Haven OnDemand recommends using a POST method with Update Connector requests.

 POST --form "connector=havenondemand" --form "config=NewConfig" https://api.havenondemand.com/1/api/sync/updateconnector/v1

You can also change the destination and the schedule.

/updateconnector/v1?connector=havenondemand&destination=NewDestinationConfig
/updateconnector/v1?connector=havenondemand&schedule=NewScheduleConfig

If you just want to delete a schedule, you can use the Cancel Connector Schedule API.

/cancelconnectorschedule/v1?connector=havenondemand

Other Configuration Options

Credentials and Credential Policies

Credentials

Credentials enable secure, encrypted connections between the Haven OnDemand APIs and the Web Cloud repositories.

For Web Cloud Connectors accessing web sites that require authentication, you must set up Credentials. In addition to the login and password themselves, you must also specify the names of the page controls that manage the action of logging in. You can find these by inspecting the HTML source code of the page.

In the config field, the parameters you need to set are:

  • form_url_regex: A regular expression to use to match the login page to which the connector sends the login parameters, for example .*login.*. This example matches any URLs that contain the string "login". The connector attempts to login on these pages.
  • login_field_value: The name of the ID field on the login page of the Web site.
  • password_field_value: The name of the password field on the login page of the Web site.
  • submit_selector: The CSS2 selector of the form submit button on the Web site login page. In the HTML source code, look for the button that submits the login form and values.

For example, the code below is the config field for a Connector to the page www.hodsite.com. You log in with an email address (a field called email) and a password (a field called pass) and click a button called login.

{
	"url": "http://www.hodsite.com",
	"login_field_value": "email",
	"password_field_value": "pass",
	"form_url_regex":".*login.*",
	"submit_selector":"button#login"
}

In the credentials field, the parameters you need to set are:

  • login_value: The value of the login, your ID for the site. In the example configuration above, that would be your email address.
  • password_value: The value of your password.

This is an example of a credentials field.

{
	"login_value": "me@myemail.com",
	"password_value": "HereIsMyPwd123"
}
Credentials Policies

Haven OnDemand also lets you to set up Credentials Policies to determine how the secured access behaves. If you set up Credentials, you must also set up a Credentials Policy. With Credentials Policies you can, for example:

  • set up email notifications when the Connector decrypts and uses the credentials to log in to the Web site.
  • limit the number of times a given decryption token can be used before it must be updated.
  • configure start and end dates for the validity of the decryption token.

Here is an example of a credentials_policy field:

{
	"notification_email": "my_email",
	"notification_email_frequency": "on_decrypt"
}

For more information, see Connector Credentials and the page for the Web Cloud Connector flavor.

Other Options for Web Cloud Connectors

For Web Cloud configurations, there are some other settings you might want to add.

Use Regular Expressions to Include or Exclude URLs

The url_must_have_regex and url_cant_have_regex parameters allow you to specify regular expressions to define pages that you want to index, and pages that you do not want to index, respectively.

Note: You must double-escape literals for JSON to process them correctly. In these examples, the full-stop in front of the string html or php is a literal. Regular expression syntax requires escaping the full-stop by putting a backslash in front of it. JSON syntax requires escaping that backslash with another backslash:
\\.

"url_must_have_regex": ".*\\.html$",
"url_cant_have_regex": ".*\\.php$",
  • The first regex matches only pages with the extension html:
    • . any character except newline.
    • * ...match zero or more times.
    • \\. ...up to the literal full-stop
    • html ...and the literal string html
    • $... at the end of the line.
  • Along the same principles, the second excludes all pages with the extension php.
Clip Pages

With the three parameters

  • clip_page,
  • clip_page_using_css_select,
  • clip_page_using_css_unselect,
you can specify regularly recurring parts of web-pages to retain for, or exclude from, indexing. This is useful, for example, to remove headers and footers, banners and boilerplate text from the indexed pages.

clip_page, set to true by default, activates page clipping. Set it to false if you want all parts of the pages on the website to be indexed.

Without further definition with either of the next two parameters, the pages are clipped using a pre-defined algorithm. Use this if the site to index has a variable structure, or if you are not sure of the exact elements you want to exclude. The algorithm generally recognizes a wide range of common CSS2 definitions of elements such as headers, footers, sidebars, navigation bars, cookie notices and banners.

The clip_page_using_css_select and clip_page_using_css_unselect parameters give you better precision and subtlety in page clipping, if the website for indexing has a regular, predictable structure. Use these to define comma-separated lists of CSS2 selectors that specify the exact parts of a page to keep or exclude. You can find the CSS2 selectors by inspecting the source code of the web pages. These parameters also keep or exclude all descendants of these elements. Make sure clip_page is set to true if you set these parameters. Setting either or both of these parameters disables the standard algorithm.

For example, the following configuration indexes only the content on the website tagged as div.body. This is likely to include the main content of the page. If the selector does not identify any elements corresponding to div.body, the connector indexes the entire page.

"clip_page": true,
"clip_page_using_css_select": "div.body"

This second example ensures that everything on the page is indexed except the elements with the tag div.banner or div.footer. If the selector does not identify any elements corresponding to div.banner or div.footer, the connector indexes the entire page.

"clip_page": true,
"clip_page_using_css_unselect": "div.banner,div.footer"
				

An empty string ("") in either the clip_page_using_css_select or clip_page_using_css_unselect parameter is equivalent to the parameter not being set. If you want to unset the parameter, update the configuration with an empty string.

Set Limits

In the previous Connector configuration examples, the Connector points to the www.hodsite.com page, and indexes the root page and all the sub-pages. If you want to restrict indexing, you can do this by maximum number (of pages), processing duration, or depth (0 for the root page only, 1, follow one link away from the root page, and so on).

"max_pages": 40,
"duration": 300,
"depth": 3
Other Controls for Pages to Index

The max_links_per_page parameter lets you choose not to index pages with too many links, for example to avoid indexing a table of contents or sitemap.

"max_links_per_page": 1000,

The min_page_size and max_page_size parameters allow you to set restrictions (in bytes) on the sizes of the pages to index.

"min_page_size": 4096,
"max_page_size": 30000000,