Introduction to Anomaly Detection

Anomaly detection is the extraction of most anomalous records (rows) in a specific dataset. Anomaly detection uses a new anomaly scoring algorithm developed at Hewlett Packard Labs.

Anomaly detection is particularly suited for cases where the user seeks exceptional (anomalous) rows that correspond to high potential risk (or gain), but where no explicit rules are known for identifying such records. Anomaly detection looks for rows that contain exceptionally rare values or combinations of values, assigns them an anomaly score accordingly, and highlights the top-ranking anomalous rows.

For example, a banker might look for unusual credit applications in a dataset that lists attributes of the customers applying for a loan, such as financial status and history, personal status, and employment status. An example of an anomalous credit application, with several attribute values or combinations that are rare in credit applications, would be a customer who is unemployed, aged above 65, and already has three existing credits at the bank, yet who also has large sums in a checking account and a savings account at the same bank.

The Anomaly Detection API provides an easy way to highlight and extract the most unusual records.

Anomaly Detection API

This version of the Anomaly Detection API is designed for categorical (qualitative) datasets. A categorical dataset is a table where each column is an attribute that can take discrete values. Categorical datasets are common in the fields of network security, fraud detection, credit risk evaluation, social sciences, and so on.

The Anomaly Detection API uses statistics and data modeling to analyze the data.

The input data is a CSV (comma-separated values) table, where each row is one record. The assumption is that all columns in the data contain categorical attributes. Numerical values are treated as symbols (strings).

  • Column features. The API analyzes two types of column features: single columns, and column pair combinations. The value of a column pair in a particular row is the combination (tuple) of values that appear in each of the individual columns in that row.
  • Anomaly model. For each of the column features, the API calculates a statistical anomaly model that assigns an anomaly score to each of the unique values of that column feature, based on their repetition count. High anomaly scores are assigned to unique values with counts considerably below the counts of the vast majority of the other unique values. The anomaly scores for each column feature are normalized to the range 0-1, so that they are comparable between different column features.
  • Row anomaly score. For each row (record), the API assigns a row anomaly score that aggregates all the anomaly scores of the values and value combinations that appear in that row. Unlike the anomaly scores for the column features, the row anomaly scores are not normalized, and the range of the score depends on the number of columns analyzed.
  • Row anomaly description. For each row (record) which has a non-zero row anomaly score, the API states its index and lists the significantly contributing column features, along with their type (single or combination), value, and anomaly score.

The Anomaly Detection API uses the following process to analyze the data.

  1. Validate the input. The API ensures that the required input is provided. The API requires either a CSV file, or a URL that points at the location of the CSV-formatted data.
  2. Validate the data. The API ensures that the data is structured correctly, and that it has the correct data types. In particular, it checks that:
    1. there are no unsupported characters in the data.
    2. the field content types are correct (RICH_TEXT, INTEGER, or DOUBLE).
    3. there are the correct number of values in each row.
  3. Explore the data. To prepare the data for anomaly detection, the API derives additional metadata, to fit with the requirements of the algorithm. The API searches the data for columns that have too many unique values to be analyzed as categorical attributes.
  4. Compute the anomaly model for all column features. The API calculates an anomaly model for each column (or column pair). The model assigns an anomaly score to each of the unique values (or unique value combinations).
  5. Compute row anomaly scores. For each row, the API applies the the anomaly models for each column feature to the values and value combinations found in that row. The API then combines the the anomaly scores for each column feature into a row anomaly score.
  6. Return the top anomalous rows. The API returns the row anomaly descriptions for the top row anomaly scores.

Example Use Case

The example dataset is derived from the publicly known German Credit dataset, where each row describes attributes of the credit request and the loaner:

  • Credit_Amount
  • Credit_Duration
  • Purpose
  • Checking_Account
  • Savings_Account
  • Credit_History
  • Num_Existing_Credits
  • Job_Class
  • Foreign_Worker
  • Present_Employment_Duration
  • Present_Residence_Duration
  • Installment_Rate
  • Age_Group
  • Personal_Status_and_Gender
  • Housing
  • Other_Debtors_Guarantors
  • Other_Installment_Plans
  • Property
  • Num_Liables

Some of the continuous numerical attributes in the original dataset were bucketed in order to facilitate their inclusion in categorical anomaly detection (for example, the original numerical attribute Age_in_years was transformed by bucketing into the categorical attribute Age_Group).

The top anomaly returned by the API is row 775 in the dataset. The description of the anomaly in JSON format includes the overall anomaly score for the row (row_anomaly_score), and the list of eight anomalies of single values or value combinations in that row, sorted by their feature anomaly scores:

{
	"anomaly_row": 775,
	"row_anomaly_score": 1.5853007410460305,
	"anomalies": [
		{
			"type": "single",
			"columns": [
				{
					"column": "Job_Class",
					"value": "unemployed/ unskilled non-resident"
				}
			],
			"anomaly_score": 0.33839424703580895
		},
		{
			"type": "single",
			"columns": [
				{
					"column": "Num_Existing_Credits",
					"value": "3"
				}
			],
			"anomaly_score": 0.3348222860027048
		},
		{
			"type": "single",
			"columns": [
				{
					"column": "Age_Group",
					"value": "65-69"
				}
			],
			"anomaly_score": 0.2409956257260984
		},
		{
			"type": "combination",
			"columns": [
				{
					"column": "Present_Employment_Duration",
					"value": "unemployed"
				},
				{
					"column": "Checking_Account",
					"value": "... >= 200 DM / salary assignments for at least 1 year"
				}
			],
			"anomaly_score": 0.19348424608186734
		},
		{
			"type": "single",
			"columns": [
				{
					"column": "Checking_Account",
					"value": "... >= 200 DM / salary assignments for at least 1 year"
				}
			],
			"anomaly_score": 0.15732947418092894
		},
		{
			"type": "single",
			"columns": [
				{
					"column": "Savings_Account",
					"value": "500 <= ... < 1000 DM"
				}
			],
			"anomaly_score": 0.1224697118525933
		},
		{
			"type": "single",
			"columns": [
				{
					"column": "Present_Employment_Duration",
					"value": "unemployed"
				}
			],
			"anomaly_score": 0.11563592441467033
		},
		{
			"type": "single",
			"columns": [
				{
					"column": "Housing",
					"value": "for free"
				}
			],
			"anomaly_score": 0.10626831248579774
		}
	]
}

In particular, the anomalies list includes two single values which are anomalous by themselves (Present_Employment_Duration = "unemployed" and Checking_Account = "... >= 200 DM / salary assignments for at least 1 year"), but which are also anomalous as a combination:

"type": "combination",
	"columns": [
		{
			"column": "Present_Employment_Duration",
			"value": "unemployed"
		},
		{
			"column": "Checking_Account",
			"value": "... >= 200 DM / salary assignments for at least 1 year"
		}
	],
	"anomaly_score": 0.19348424608186734

This means that not only is each of the values relatively rare in its own column, they are much more rare as a combination. Indeed, out of all the relatively few unemployed credit applicants, it is rare to find applicants that have large sums in their checking account or salary assignments for the last year.

The anomalies specify the reasons why the overall anomaly of this row is high. The story that emerges is that the applicant is in the age group 65-69 and unemployed (probably retired), and already has three existing credits at the bank, yet has large sums in a checking account and in a savings account at the same bank. A banker assessing credit applications would probably want to investigate such a case to understand why the applicant is seeking more credit. Note that the API found this unusual case without any domain-specific training in banking or credit, and without any prior tagging of attribute values or combinations as unusual.