Web Cloud Connector

The Web Cloud connector retrieves content from a public Web address and indexes it into Haven OnDemand. The connector follows links from a start page that you provide, and crawls to find and retrieve other Web pages on the same host.

This connector is a cloud connector that runs entirely in the Haven OnDemand environment.

Web_cloud Connector Configuration

The following table outlines the configuration options that you can set for the Web_cloud connector flavor. You can use these in the JSON object that you pass to the config parameter in the Create Connector API.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
url String The URL of the Web address that you want to extract content from. You can specify only one URL.

Optional Parameters

Parameter Type Description Default
url_cant_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is ingested only if the entire URL of the page does not match the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to not retrieve pages with a .php extension use the regular expression .*\.php$.
url_must_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is ingested only if the entire URL matches the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to fetch only pages with a .html extension specify .*\.html$.
max_pages Integer The maximum number of pages that the connector retrieves before stopping. The minimum is 10 pages. The maximum is 1000000 pages. 10000
duration Integer The maximum number of seconds the connector spends crawling. The minimum is 60 seconds. 3600
max_links_per_page Integer The maximum number of links permitted on a page. A page that exceeds this number of links is crawled (any links are followed), but the page is not ingested. To specify no limit, set max_links_per_page=0. The minimum is 0. 1000
max_page_size Integer The maximum size of a page to ingest, in bytes. A page that exceeds this size is crawled (any links are followed), but the page is not ingested. This parameter applies to all file types, including images. The minimum value is 0, which specifies no maximum size. 163840
min_page_size Integer The minimum size of a page to ingest, in bytes. If a page is smaller than this, the page is crawled (any links are followed), but the page is not ingested. This parameter applies to all file types, including images. The minimum value is 0, which specifies no minimum size. 4096
page_timeout Integer The maximum number of seconds the connector waits to download a page before timing out. The minimum is 1 second, and the maximum is 604800. 120
depth Integer The maximum depth to which the connector follows links when crawling. For example, to index all pages that can be reached from the URL by following no more than three links, set Depth=3. To index only the page specified by URL, set Depth=0. To specify no limit, set Depth=-1 3
follow_redirect Boolean Set this parameter to false to prevent the connector from following HTTP redirections. True
follow_robot_protocol Boolean Set this parameter to false to prevent the connector from following robot protocol. True
link_attributes String A comma-separated list of HTML attributes that the connector treats as links. href
login_field_value String The value of the login field name for the Web site. You can find this for the Web page by inspecting the HTML source and looking for the field name for login input.
password_field_value String The value of the password field name for the Web site. You can find this value for the Web page by inspecting the HTML source and looking for the field name for password input.
form_url_regex String A regular expression to use to match the login page to which the connector sends the login parameters, for example .*login.*. This example matches any URLs that contain the string login. The connector attempts to login on these pages.
submit_selector String The name of the form submit button for the Web site login page. You can find this value for the Web page by inspecting the HTML source and looking for the field name for the button that submits the login form and values.
clip_page Boolean Set this parameter to false if you do not want to clip HTML Web pages. Clipping removes uninteresting parts of a page. To specify the parts of pages to keep and remove, set the clip_page_using_css_select and clip_page_using_css_unselect parameters. If you do not set these parameters, the Web Connector uses an algorithm to decide which parts of the page to keep. True
clip_page_using_css_select String A comma-separated list of CSS2 selectors that specify the parts of a page to keep when the page is clipped. The Web Connector also keeps all descendants of these elements. To clip pages you must set the clip_pages parameter to true.
clip_page_using_css_unselect String A comma-separated list of CSS2 selectors that specify the parts of a page to remove when the page is clipped. The Web Connector also removes all descendants of these elements. The clip_page_using_css_select field is applied before clip_page_using_css_unselect, so you can use this parameter to remove unwanted descendants of elements that you specify in clip_page_using_css_select. To clip pages you must set the clip_pages parameter to true.

Example Configuration

{
	"url": "http://www.wiki.com",
	"url_cant_have_regex": ".*\\.php$",
	"url_must_have_regex": ".*\\.html$",
	"max_pages": 40,
	"duration": 3600,
	"max_links_per_page": 1000,
	"max_page_size": 10000000,
	"min_page_size": 4096,
	"page_timeout": 120,
	"depth": 3,
	"follow_redirect": true,
	"follow_robot_protocol": true,
	"link_attributes": "href",
	"login_field_value": "os_username",
	"password_field_value": "os_password",
	"form_url_regex":".*login.*",
	"submit_selector":"input[name=login]",
	"clip_page": true,
	"clip_page_using_css_select": "div.body",
	"clip_page_using_css_unselect": "div.banner,div.footer"
}

Other Web Cloud Connector Behavior Restrictions

  • The Connector follows links only if they are on the same host.
  • Connectors spend a maximum of six minutes on a page during a single connector run.

Web_cloud Connector Destination

This section outlines the options that you can set for the destination that the connector indexes into. You can use these in the JSON object that you pass to the destination parameter in the Create Connector API.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
action Enum The action to take when indexing documents. You can use the following options:
  • addtotextindex. Add documents directly to a Haven OnDemand text index.

Parameters for Add to Text Index Action

The following parameters are required in the destination JSON object when action is set to addtotextindex

Parameter Type Description
index String The name of the text index that you want to index documents into. This index must already exist in Haven OnDemand (created by the Create Text Index API).

Example Destination

{
	"action": "addtotextindex",
	"index": "testindex"
}

Web_cloud Connector Schedule

This section outlines the options that you can set for the schedule that the connector runs on. You can use these in the JSON object that you pass to the schedule parameter in the Create Connector API.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
frequency Object The frequency configuration that describes how often to run the connector.

The frequency object must contain the following parameter:

Parameter Type Description
frequency_type Enum The type of frequency configuration to use. This setting affects the other parameters that you must set in the frequency object. You can use one of the following values:
  • seconds. The connector frequency is set in seconds. You must also specify the interval parameter.

When you have set the frequency_type parameter to seconds, you must also set the following parameters:

Parameter Type Description
interval Integer The number of seconds between each connector run. This interval measures from the start of one connector run to the start of the next. The maximum interval is 31536000.
Note: The exact interval that the connector uses might vary by up to 30 minutes, depending on load on the system and the scheduler.

Optional Parameters

Parameter Type Description
occurrences Integer The number of times to attempt to schedule a connector run. If you do not set occurrences, the number of runs is unlimited. The schedule stops either after this number of runs, or when it reaches the configured end_date, whichever occurs first.
start_date String The date to start scheduling the connector. For a list of available date formats, see Date Formats for Parameters. If you do not set a start_date, the connector runs after the first interval elapses.
end_date String The date to stop scheduling the connector. For a list of available date formats, see Date Formats for Parameters. The schedule stops either after this date, or when it has run the number of times configured in occurrences, whichever occurs first.

Example Configuration

{
	"occurrences": 5,
	"start_date": "1",
	"end_date": "29/06/2015 12:00:00 -0600",
	"frequency": {
		"frequency_type": "seconds",
		"interval": 21600
	}
}

Schedule Errors

If an attempt to run a connector fails, for example because an error occurred on the system, the Connector Status and Connector History APIs return an error status in the response for the schedule. In this case, Haven OnDemand attempts to retry the connctor schedule up to three times. When a schedule fails, Haven OnDemand attempts to retry it the next time it scans the connector schedules (every minute).

If the schedule fails three times, Haven OnDemand stops the connector schedule. In this case, you must either use the Update Connector API to set a new schedule, or manually start the connector with the Start Connector API.

Web_cloud Connector Credentials

This section outlines the options that you can use to set credentials for the connector. You can use these in the JSON object that you pass to the credentials parameter in the Create Connector API. The credentials parameter is optional for Web_cloud flavor connectors.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
login_value String The value to use to populate the login_field_value field that you set in the connector configuration.
password_value String The value to use to populate the password_field_value field that you set in the connector configuration.

Example Connector Credentials

{
	"login_value": "login",
	"password_value": "password"
}

Web_cloud Connector Credentials Policy

This section outlines the options that you can use to set the credentials policy for the connector. The credentials policy options define when the system can decrypt credentials. You can use these parameters in the JSON object that you pass to the credentials_policy parameter in the Create Connector API. The credentials_policy parameter is required if you have set credentials.

The credentials policy controls how Haven OnDemand manages decryption tokens for storing and decrypting the credentials that the connector uses to access the repository. You can obtain a decryption token from the Start Connector and Retrieve Config APIs, which require the decryption of the connector credentials. Haven OnDemand sends the decryption token to an email address that you specify in the credentials policy.

The credentials policy also specifies how long the decryption token is valid for. If you send an invalid token to one of the APIs that requires it, the API automatically generates and sends a new token to the email.

The credentials policy has its own expiration date. After this time, you must renew the policy with the Update Connector API.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
notification_email String The email address to which to send information about connector activity.

Optional Parameters

Parameter Type Description Default
token_expiration Integer The number of seconds that a generated token remains valid. The expiration time is counted from the moment the token was generated. Every generated token is valid for the specified duration, and can be used for decryption a number of times specified by token_occurrences. After the token_expiration time, the token cannot be used, even if it has not had token_occurrences uses. When the token expires, a new token is generated, resetting the token_expiration time and token_occurrences. The minimum value is 1. 1800
token_occurrences Integer The number of times that a generated token can be used for decryption. Every generated token is valid for this number of uses, and can be used for a duration specified by token_expiration. After it has been used token_occurrences times, the token cannot be used, even if the token_expiration time has not been reached. When the number of uses are exhausted, a new token is generated, resetting the token_expiration time and token_occurrences. The minimum value is 1. 1
key_expiration String The duration that the credentials policy is valid for. When the key expires, the Haven OnDemand key management service returns an error stating that the policy has expired. For a list of available date formats, see Date Formats for Parameters. 3 months
notification_email_frequency Enum The frequency to use to send information about connector activity to the notification_email address. You can use the following values:
  • always. Always send email notifications for all connector activity.
  • on_decrypt. Send email notifications only when an attempt to decrypt connector credentials occurs.
  • on_failure. Send email notifications only when a failure occurs when using connector credentials.
  • never. Never send email notifications.
on_decrypt

Example Credentials Policy

{
	"notification_email": "test@example.com",
	"notification_email_frequency": "always",
	"key_expiration": "19/06/2015 11:25:00",
	"token_expiration": 3600,
	"token_occurrences": 10
}

Web_cloud Connector Limits

The Web Cloud Flavor Connector has the following limits:

Config

Property Max Limit
max_pages 1000000
max_links_per_page 10000000
page_timeout 604800

Schedule

Property Max Limit
interval 31536000

Static_resource_unit_cost

It costs 1 static resource unit to create a web_cloud flavor connector.

Start_connector_unit_cost

It costs 5 start connector units to start a web_cloud flavor connector.