Web Onsite Connector
Configuration settings for adjusting the Haven OnDemand Onsite Web Connector.

Web Onsite Connector

The Web Onsite Connector retrieves content from Web address and indexes it into Haven OnDemand. The connector follows links from a start page that you provide, and crawls to find and retrieve other Web pages on the same host.

The Web Onsite Connector is an onsite connector, which you install on your own system.

Web Onsite Connector Configuration

The following table outlines the configuration options that you can set for the web_onsite connector flavor. You can use these in the JSON object that you pass to the config parameter in the Create Connector API.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
url String The URL of the Web address that you want to extract content from. You can specify only one URL.

Optional Parameters

Parameter Type Description Default
max_task_duration String If specified, the maximum duration of a task in the format H[H][:MM][:SS]. If the maximum duration is exceeded, the task stops.
service_port Integer The configured service port for the connector. This value is set in the connector configuration file, and the connector uses this port to listen for service control requests, such as stopping the connector.
This port must be available on the machine where the connector is installed, and the connector will open it. If port 8012 is not available on your connector host machine, you must reconfigure it. If you have more than one instance of the connector on your host machine, you must change this value for all but one of the connectors.
8012
aci_port Integer The configured action port for the connector. This value is set in the connector configuration file, and the connector uses this port to listen for actions, such as requesting a connector run.
This port must be available on the machine where the connector is installed, and the connector will open it. If port 8010 is not available on your connector host machine, you must reconfigure it. If you have more than one instance of the connector on your host machine, you must change this value for all but one of the connectors.
8010
use_proxy Boolean Set to true if the URL that you want to extract content from requires a proxy to access. You can also configure the proxy when you install the connector.
url_cant_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is ingested only if the entire URL of the page does not match the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to not retrieve pages with a .php extension use the regular expression .*\.php$. .*\.css$|.*.css\?.*$|.*\.js$|.*\.js,.*$|.*.js\?.*
url_must_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is ingested only if the entire URL matches the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to fetch only pages with a .html extension specify .*\.html$.
spider_url_cant_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is crawled for links and ingested only if the entire URL of the page does not match the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to not retrieve pages in a given directory on a site specify ^http://www\.example\.com/ignore/.*$.
spider_url_must_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is crawled for links and ingested only if the entire URL matches the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to fetch only pages in a given directory on a site specify ^http://www\.example\.com/directory/.*$.
content_type_cant_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is ingested only if the content-type header for the page does not match the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to not retrieve icons use the regular expression ^image/x-icon$. ^((application|text)/(javascript|xml|x-javascript|css)(;.*)?)|(image/x-icon)$
content_type_must_have_regex String A Perl-compatible regular expression to restrict the content retrieved by the connector. A page is ingested only if the content-type header for the page matches the regular expression. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. For example, to fetch only html pages use the regular expression ^text/html(;.*)?$
max_pages Integer The maximum number of pages that the connector retrieves before stopping. The minimum is 10 pages. The maximum is 1000000 pages. 10000
duration Integer The maximum number of seconds the connector spends crawling. The minimum is 60 seconds. 3600
max_links_per_page Integer The maximum number of links permitted on a page. A page that exceeds this number of links is crawled (any links are followed), but the page is not ingested. To specify no limit, set max_links_per_page to 0. The minimum is 0. 1000
max_page_size Integer The maximum size of a page to ingest, in bytes. A page that exceeds this size is crawled (any links are followed), but the page is not ingested. This parameter applies to all file types, including images. The minimum value is 0, which specifies no maximum size. 163840
min_page_size Integer The minimum size of a page to ingest, in bytes. If a page is smaller than this, the page is crawled (any links are followed), but the page is not ingested. This parameter applies to all file types, including images. The minimum value is 0, which specifies no minimum size. 4096
page_timeout Integer The maximum number of seconds the connector waits to download a page before timing out. The minimum is 1 second, and the maximum is 604800. 120
depth Integer The maximum depth to which the connector follows links when crawling. For example, to index all pages that can be reached from the URL by following no more than three links, set depth to 3. To index only the page specified by URL, set depth to 0. To specify no limit, set depth to -1. 3
follow_redirect Boolean Set this parameter to false to prevent the connector from following HTTP redirects. True
follow_robot_protocol Boolean Set this parameter to false to prevent the connector from following robot protocol. True
link_attributes String A comma-separated list of HTML attributes that the connector treats as links. href
login_selector String The CSS2 selector that locates the login input on the login page. Example: input#login. This will select the <input> element with id 'login'.
If you set this parameter, you must also set password_selector, submit_selector, and form_url_regex, and you must provide login_value and password_value in the connector credentials.
password_selector String The CSS2 selector that locates the password input on the login page. Example: input#pwd. This will select the <input> element with id 'pwd'.
If you set this parameter, you must also set login_selector, submit_selector, and form_url_regex, and you must provide login_value and password_value in the connector credentials.
submit_selector String The CSS2 selector that locates the login form submit button on the login page. Example: button.login. This will select the <button> element with class 'login'.
If you set this parameter, you must also set login_selector, password_selector, and form_url_regex, and you must provide login_value and password_value in the connector credentials.
form_url_regex String A regular expression to use to match the login page to which the connector sends the login parameters, for example *login*. This example matches any URLs that contain the string login. The connector attempts to log in on these pages.
If you set this parameter, you must also set login_selector, password_selector, and submit_selector, and you must provide login_value and password_value in the connector credentials.
clip_page Boolean Set this parameter to false if you do not want to clip HTML Web pages. Clipping removes uninteresting parts of a page. To specify the parts of pages to keep and remove, set the clip_page_using_css_select and clip_page_using_css_unselect parameters. If you do not set these parameters, the Web Connector uses an algorithm to decide which parts of the page to keep. True
clip_page_using_css_select String A comma-separated list of CSS2 selectors that specify the parts of a page to keep when the page is clipped. The Web Connector also keeps all descendants of these elements. To clip pages you must set the clip_pages parameter to true.
clip_page_using_css_unselect String A comma-separated list of CSS2 selectors that specify the parts of a page to remove when the page is clipped. The Web Connector also removes all descendants of these elements. The clip_page_using_css_select field is applied before clip_page_using_css_unselect, so you can use this parameter to remove unwanted descendants of elements that you specify in clip_page_using_css_select. To clip pages you must set the clip_pages parameter to true.

Example Configuration

{
	"url": "http://www.wiki.com",
	"use_proxy": true,
	"url_cant_have_regex": ".*\\.php$",
	"url_must_have_regex": ".*\\.html$",
	"max_pages": 40,
	"duration": 3600,
	"max_links_per_page": 1000,
	"max_page_size": 10000000,
	"min_page_size": 4096,
	"page_timeout": 120,
	"depth": 3,
	"follow_redirect": true,
	"follow_robot_protocol": true,
	"link_attributes": "href",
	"form_url_regex":".*login.*",
	"login_selector":"input[name=login]",
	"password_selector":"input[name=password]",
	"submit_selector":"button.btn_login",
	"clip_page": true,
	"clip_page_using_css_select": "div.body",
	"clip_page_using_css_unselect": "div.banner,div.footer"
}

Other Web Onsite Connector Behavior Restrictions

  • The Connector follows links only if they are on the same host.
  • Connectors spend a maximum of six minutes on a page during a single connector run.

Web Onsite Connector Destination

This section outlines the options that you can set for the destination that the connector indexes into. You can use these in the JSON object that you pass to the destination parameter in the Create Connector API.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
action Enum The action to take when indexing documents. You can use the following options:
  • addtotextindex. Add documents directly to a Haven OnDemand text index.

Parameters for Add to Text Index Action

The following parameters are required in the destination JSON object when action is set to addtotextindex

Parameter Type Description
index String The name of the text index that you want to index documents into. This index must already exist in Haven OnDemand (created by the Create Text Index API).

Example Destination

{
	"action": "addtotextindex",
	"index": "testindex"
}

Web Onsite Connector Schedule

This section outlines the options that you can set for the schedule that the connector runs on. You can use these in the JSON object that you pass to the schedule parameter in the Create Connector API.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
frequency Object The frequency configuration that describes how often to run the connector.

The frequency object must contain the following parameter:

Parameter Type Description
frequency_type Enum The type of frequency configuration to use. This setting affects the other parameters that you must set in the frequency object. You can use one of the following values:
  • seconds. The connector frequency is set in seconds. You must also specify the interval parameter.

When you have set the frequency_type parameter to seconds, you must also set the following parameters:

Parameter Type Description
interval Integer The number of seconds between each connector run. This interval measures from the start of one connector run to the start of the next. The maximum interval is 31536000.
Note: The exact interval that the connector uses might vary by up to 30 minutes, depending on load on the system and the scheduler.

Optional Parameters

Parameter Type Description
occurrences Integer The number of times to attempt to schedule a connector run. If you do not set occurrences, the number of runs is unlimited. The schedule stops either after this number of runs, or when it reaches the configured end_date, whichever occurs first.
start_date String The date to start scheduling the connector. For a list of available date formats, see Date Formats for Parameters. If you do not set a start_date, the connector runs after the first interval elapses.
end_date String The date to stop scheduling the connector. For a list of available date formats, see Date Formats for Parameters. The schedule stops either after this date, or when it has run the number of times configured in occurrences, whichever occurs first.

Example Configuration

{
	"occurrences": 5,
	"start_date": "1",
	"end_date": "29/06/2015 12:00:00 -0600",
	"frequency": {
		"frequency_type": "seconds",
		"interval": 21600
	}
}

Schedule Errors

If an attempt to run a connector fails, for example because an error occurred on the system, the Connector Status and Connector History APIs return an error status in the response for the schedule. In this case, Haven OnDemand attempts to retry the connctor schedule up to three times. When a schedule fails, Haven OnDemand attempts to retry it the next time it scans the connector schedules (every minute).

If the schedule fails three times, Haven OnDemand stops the connector schedule. In this case, you must either use the Update Connector API to set a new schedule, or manually start the connector with the Start Connector API.

Web Onsite Connector Credentials

This section outlines the options that you can use to set credentials for the connector. You can use these in the JSON object that you pass to the credentials parameter in the Create Connector API. The credentials parameter is optional for web_onsite flavor connectors.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
login_value String The value to use to populate the login_selector field that you set in the connector configuration.
If you set this parameter, you must also set password_value. You must also set login_selector, password_selector, submit_selector, and form_url_regex in the connector configuration.
password_value String The value to use to populate the password_selector field that you set in the connector configuration.
If you set this parameter, you must also set long_value. You must also set login_selector, password_selector, submit_selector, and form_url_regex in the connector configuration.
auth_user String The user name to use for pages that request authentication. You can use this parameter for Basic, HTTP Digest and NTLM version 2 authentication.
If you set this parameter, you must also set auth_password.
auth_password String The password to use for pages that request authentication. You can use this parameter for Basic, HTTP Digest and NTLM version 2 authentication.
If you set this parameter, you must also set auth_user.

Example Connector Credentials

{
	"login_value": "login",
	"password_value": "password"
}
{
	"auth_user": "login",
	"auth_password": "password"
}

Web Onsite Connector Credentials Policy

This section outlines the options that you can use to set the credentials policy for the connector. The credentials policy options define when the system can decrypt credentials. You can use these parameters in the JSON object that you pass to the credentials_policy parameter in the Create Connector API. The credentials_policy parameter is required if you have set credentials.

The credentials policy controls how Haven OnDemand manages credential decryption for managing the credentials that the connector uses to access the repository. When you start the onsite connector, it retrieves the connector configuration, including any credentials it requires, from Haven OnDemand. You can use the credentials_policy to determine how long the connector can use these credentials for. After this time, you must renew the policy with the Update Connector API.

Note: All the options are case sensitive.

Required Parameters

Parameter Type Description
notification_email String The email address to which to send information about connector activity.

Optional Parameters

Parameter Type Description Default
key_expiration String The duration that the credentials policy is valid for. When the key expires, the Haven OnDemand key management service returns an error stating that the policy has expired. For a list of available date formats, see Date Formats for Parameters. 3 months
notification_email_frequency Enum The frequency to use to send information about connector activity to the notification_email address. You can use the following values:
  • always. Always send email notifications for all connector activity.
  • on_decrypt. Send email notifications only when an attempt to decrypt connector credentials occurs.
  • on_failure. Send email notifications only when a failure occurs when using connector credentials.
  • never. Never send email notifications.
on_decrypt

Important: When you use the key_expiration field, if the credentials policy expires, the connector fails to retrieve the configuration necessary to run the connector. In this case, you must renew the credentials by using the Update Connector API to update the connector with your credentials, and then restarting the connector.

Example Credentials Policy

{
	"notification_email": "test@test.com",
	"notification_email_frequency": "always",
	"key_expiration": "19/06/2015 11:25:00"
}

Web Onsite Connector Limits

The web_onsite flavor Connector has the following limits:

Config

Property Max Limit
max_pages 1000000
max_links_per_page 10000000
page_timeout 604800

Schedule

Property Max Limit
interval 31536000

Static_resource_unit_cost

It costs 1 static resource unit to create a web_onsite flavor connector.