Web Cloud Connector
Configuration settings for adjusting the Haven OnDemand Cloud Web Connector.

Web Cloud Connector

The Web Cloud connector retrieves content from a public Web address and indexes it into Haven OnDemand. The connector follows links from a start page that you provide, and crawls to find and retrieve other Web pages on the same host.

This connector is a cloud connector that runs entirely in the Haven OnDemand environment.

Web Cloud Connector Configuration

The following table outlines the configuration options that you can set for the web_cloud connector flavor. You can use these in the JSON object that you pass to the config parameter in the Create Connector API.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
url String The URL of the Web address that you want to extract content from. You can specify only one URL.

Optional Parameters

Parameter Type Description Default
url_cant_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is ingested only if the entire URL of the page does not match the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to not retrieve pages with a .php extension use the regular expression .*\.php$. .*\.css$|.*.css\?.*$|.*\.js$|.*\.js,.*$|.*.js\?.*
url_must_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is ingested only if the entire URL matches the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to fetch only pages with a .html extension specify .*\.html$.
spider_url_cant_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is crawled for links and ingested only if the entire URL of the page does not match the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to not retrieve pages in a given directory on a site specify ^http://www\.example\.com/ignore/.*$.
spider_url_must_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is crawled for links and ingested only if the entire URL matches the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to fetch only pages in a given directory on a site specify ^http://www\.example\.com/directory/.*$.
content_type_cant_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is ingested only if the content-type header for the page does not match the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to not retrieve icons use the regular expression ^image/x-icon$. ^((application|text)/(javascript|xml|x-javascript|css)(;.*)?)|(image/x-icon)$
content_type_must_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is ingested only if the content-type header for the page matches the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to fetch only html pages use the regular expression ^text/html(;.*)?$
max_pages Integer The maximum number of pages that the connector retrieves before stopping. The minimum is 10 pages. The maximum is 1000000 pages. 10000
duration Integer The maximum number of seconds the connector spends crawling. The minimum is 60 seconds. 3600
max_links_per_page Integer The maximum number of links permitted on a page. A page that exceeds this number of links is crawled (any links are followed), but the page is not ingested. To specify no limit, set max_links_per_page to 0. The minimum is 0. 1000
max_page_size Integer The maximum size of a page to ingest, in bytes. A page that exceeds this size is crawled (any links are followed), but the page is not ingested. This parameter applies to all file types, including images. The minimum value is 0, which specifies no maximum size. 163840
min_page_size Integer The minimum size of a page to ingest, in bytes. If a page is smaller than this, the page is crawled (any links are followed), but the page is not ingested. This parameter applies to all file types, including images. The minimum value is 0, which specifies no minimum size. 4096
page_timeout Integer The maximum number of seconds the connector waits to download a page before timing out. The minimum is 1 second, and the maximum is 604800. 120
depth Integer The maximum depth to which the connector follows links when crawling. For example, to index all pages that can be reached from the URL by following no more than three links, set depth to 3. To index only the page specified by URL, set depth to 0. To specify no limit, set depth to -1. 3
follow_redirect Boolean Set this parameter to false to prevent the connector from following HTTP redirections. True
follow_robot_protocol Boolean Set this parameter to false to prevent the connector from following robot protocol. True
link_attributes String A comma-separated list of HTML attributes that the connector treats as links. To be recognized as a link, an attribute value must match the following regular expression:
<^http:.*$|^https:.*$|^data:.*$
In other words, it must begin with http:, https: or data:.
href
login_selector String The CSS2 selector that locates the login input on the login page. Example: input#login. This will select the <input> element with id 'login'.
If you set this parameter, you must also set password_selector, submit_selector, and form_url_regex, and you must provide login_value and password_value in the connector credentials.
password_selector String The CSS2 selector that locates the password input on the login page. Example: input#pwd. This will select the <input> element with id 'pwd'.
If you set this parameter, you must also set login_selector, submit_selector, and form_url_regex, and you must provide login_value and password_value in the connector credentials.
submit_selector String The CSS2 selector that locates the login form submit button on the login page. Example: button.login. This will select the <button> element with class 'login'.
If you set this parameter, you must also set login_selector, password_selector, and form_url_regex, and you must provide login_value and password_value in the connector credentials.
form_url_regex String A regular expression to use to match the login page to which the connector sends the login parameters, for example *login*. This example matches any URLs that contain the string login. The connector attempts to log in on these pages.
If you set this parameter, you must also set login_selector, password_selector, and submit_selector, and you must provide login_value and password_value in the connector credentials.
clip_page Boolean Set this parameter to false if you do not want to clip HTML Web pages. Clipping removes uninteresting parts of a page. To specify the parts of pages to keep and remove, set the clip_page_using_css_select and clip_page_using_css_unselect parameters. If you do not set these parameters, the Web Connector uses an algorithm to decide which parts of the page to keep. True
clip_page_using_css_select String A comma-separated list of CSS2 selectors that specify the parts of a page to keep when the page is clipped. The Web Connector also keeps all descendants of these elements. To clip pages you must set the clip_pages parameter to true. If you want to remove this setting from your configuration, set it to an empty string.
clip_page_using_css_unselect String A comma-separated list of CSS2 selectors that specify the parts of a page to remove when the page is clipped. The Web Connector also removes all descendants of these elements. The clip_page_using_css_select field is applied before clip_page_using_css_unselect, so you can use this parameter to remove unwanted descendants of elements that you specify in clip_page_using_css_select. To clip pages you must set the clip_pages parameter to true. If you want to remove this setting from your configuration, set it to an empty string.

Example Configuration

{
	"url": "http://www.wiki.com",
	"url_cant_have_regex": ".*\\.php$",
	"url_must_have_regex": ".*\\.html$",
	"max_pages": 40,
	"duration": 3600,
	"max_links_per_page": 1000,
	"max_page_size": 10000000,
	"min_page_size": 4096,
	"page_timeout": 120,
	"depth": 3,
	"follow_redirect": true,
	"follow_robot_protocol": true,
	"link_attributes": "href",
	"form_url_regex":".*login.*",
	"login_selector":"input[name=login]",
	"password_selector":"input[name=password]",
	"submit_selector":"button.btn_login",
	"clip_page": true,
	"clip_page_using_css_select": "div.body",
	"clip_page_using_css_unselect": "div.banner,div.footer"
}

Other Web Cloud Connector Behavior Restrictions

  • The Connector follows links only if they are on the same host.
  • Connectors spend a maximum of six minutes on a page during a single connector run.

Web Cloud Connector Destination

This section outlines the options that you can set for the destination that the connector indexes into. You can use these in the JSON object that you pass to the destination parameter in the Create Connector API.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
action Enum The action to take when indexing documents. You can use the following options:
  • addtotextindex. Add documents directly to a Haven OnDemand text index.

Parameters for Add to Text Index Action

The following parameters are required in the destination JSON object when action is set to addtotextindex

Parameter Type Description
index String The name of the text index that you want to index documents into. This index must already exist in Haven OnDemand (created by the Create Text Index API).

Example Destination

{
	"action": "addtotextindex",
	"index": "testindex"
}

Web Cloud Connector Schedule

This section outlines the options that you can set for the schedule that the connector runs on. You can use these in the JSON object that you pass to the schedule parameter in the Create Connector API.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
frequency Object The frequency configuration that describes how often to run the connector.

The frequency object must contain the following parameter:

Parameter Type Description
frequency_type Enum The type of frequency configuration to use. This setting affects the other parameters that you must set in the frequency object. You can use one of the following values:
  • seconds. The connector frequency is set in seconds. You must also specify the interval parameter.

When you have set the frequency_type parameter to seconds, you must also set the following parameters:

Parameter Type Description
interval Integer The number of seconds between each connector run. This interval measures from the start of one connector run to the start of the next. The maximum interval is 31536000.
Note: The exact interval that the connector uses might vary by up to 30 minutes, depending on load on the system and the scheduler.

Optional Parameters

Parameter Type Description
occurrences Integer The number of times to attempt to schedule a connector run. If you do not set occurrences, the number of runs is unlimited. The schedule stops either after this number of runs, or when it reaches the configured end_date, whichever occurs first.
start_date String The date to start scheduling the connector. For a list of available date formats, see Date Formats for Parameters. If you do not set a start_date, the connector runs after the first interval elapses.
end_date String The date to stop scheduling the connector. For a list of available date formats, see Date Formats for Parameters. The schedule stops either after this date, or when it has run the number of times configured in occurrences, whichever occurs first.

Example Configuration

{
	"occurrences": 5,
	"start_date": "1",
	"end_date": "29/06/2015 12:00:00 -0600",
	"frequency": {
		"frequency_type": "seconds",
		"interval": 21600
	}
}

Schedule Errors

If an attempt to run a connector fails, for example because an error occurred on the system, the Connector Status and Connector History APIs return an error status in the response for the schedule. In this case, Haven OnDemand attempts to retry the connctor schedule up to three times. When a schedule fails, Haven OnDemand attempts to retry it the next time it scans the connector schedules (every minute).

If the schedule fails three times, Haven OnDemand stops the connector schedule. In this case, you must either use the Update Connector API to set a new schedule, or manually start the connector with the Start Connector API.

Web Cloud Connector Credentials

This section outlines the options that you can use to set credentials for the connector. You can use these in the JSON object that you pass to the credentials parameter in the Create Connector API. The credentials parameter is optional for web_cloud flavor connectors.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
login_value String The value to use to populate the login_selector field that you set in the connector configuration.
If you set this parameter, you must also set password_value. You must also set login_selector, password_selector, submit_selector, and form_url_regex in the connector configuration.
password_value String The value to use to populate the password_selector field that you set in the connector configuration.
If you set this parameter, you must also set long_value. You must also set login_selector, password_selector, submit_selector, and form_url_regex in the connector configuration.
auth_user String The user name to use for pages that request authentication. You can use this parameter for Basic, HTTP Digest and NTLM version 2 authentication.
If you set this parameter, you must also set auth_password.
auth_password String The password to use for pages that request authentication. You can use this parameter for Basic, HTTP Digest and NTLM version 2 authentication.
If you set this parameter, you must also set auth_user.

Example Connector Credentials

{
	"login_value": "login",
	"password_value": "password"
}
{
	"auth_user": "login",
	"auth_password": "password"
}

Web Cloud Connector Credentials Policy

This section outlines the options that you can use to set the credentials policy for the connector. The credentials policy options define when the system can decrypt credentials. You can use these parameters in the JSON object that you pass to the credentials_policy parameter in the Create Connector API. The credentials_policy parameter is required if you have set credentials.

The credentials policy controls how Haven OnDemand manages decryption tokens for storing and decrypting the credentials that the connector uses to access the repository. You can obtain a decryption token from the Start Connector and Retrieve Config APIs, which require the decryption of the connector credentials. Haven OnDemand sends the decryption token to an email address that you specify in the credentials policy.

The credentials policy also specifies how long the decryption token is valid for. If you send an invalid token to one of the APIs that requires it, the API automatically generates and sends a new token to the email.

The credentials policy has its own expiration date. After this time, you must renew the policy with the Update Connector API.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
notification_email String The email address to which to send information about connector activity.

Optional Parameters

Parameter Type Description Default
token_expiration Integer The number of seconds that a generated token remains valid. The expiration time is counted from the moment the token was generated. Every generated token is valid for the specified duration, and can be used for decryption a number of times specified by token_occurrences. After the token_expiration time, the token cannot be used, even if it has not had token_occurrences uses. When the token expires, a new token is generated, resetting the token_expiration time and token_occurrences. The minimum value is 1. 1800
token_occurrences Integer The number of times that a generated token can be used for decryption. Every generated token is valid for this number of uses, and can be used for a duration specified by token_expiration. After it has been used token_occurrences times, the token cannot be used, even if the token_expiration time has not been reached. When the number of uses are exhausted, a new token is generated, resetting the token_expiration time and token_occurrences. The minimum value is 1. 1
key_expiration String The duration that the credentials policy is valid for. When the key expires, the Haven OnDemand key management service returns an error stating that the policy has expired. For a list of available date formats, see Date Formats for Parameters. 3 months
notification_email_frequency Enum The frequency to use to send information about connector activity to the notification_email address. You can use the following values:
  • always. Always send email notifications for all connector activity.
  • on_decrypt. Send email notifications only when an attempt to decrypt connector credentials occurs.
  • on_failure. Send email notifications only when a failure occurs when using connector credentials.
  • never. Never send email notifications.
on_decrypt

Example Credentials Policy

{
	"notification_email": "test@example.com",
	"notification_email_frequency": "always",
	"key_expiration": "19/06/2015 11:25:00",
	"token_expiration": 3600,
	"token_occurrences": 10
}

Web Cloud Connector Limits

The web_cloud flavor Connector has the following limits:

Config

Property Max Limit
max_pages 1000000
max_links_per_page 10000000
page_timeout 604800

Schedule

Property Max Limit
interval 31536000

Static_resource_unit_cost

It costs 1 static resource unit to create a web_cloud flavor connector.

Start_connector_unit_cost

It costs 5 start connector units to start a web_cloud flavor connector.