Predictive Analytics
An introduction to predictive analytics and the Haven OnDemand Prediction APIs.

Introduction to Predictive Analytics

Predictive analytics is the use of data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data.

Predictive analytics works by analyzing historical data that represents the relevant patterns and the outcomes. It uses the patterns in this data to statistically calculate the likelihood of a specific result. The learning process is the predictor training session. After you train the predictor, it creates a mathematical model (the prediction model), which you can apply to similar data to predict the probable outcomes.

For example, you might have a set of data that lists various properties and characteristics of different plants, along with the name of the species. You can train a prediction model by using this data, and use it to identify the most likely species of plant from a set of the same properties where the species is not known.

You can also use the same prediction model in reverse - to find the factors most likely to produce a given result. For example, given suitable and sufficient data, you can sift out what conditions correlate most to votes for a candidate, sales from a web site, wins on a board game, survival of epidemics, stress resistance in buildings or fuel efficiency in vehicles.

A limitation of machine learning is that it can only establish correlations between conditions and outcomes, not causal relationships. It is up to the data analyst to look for the deeper associations between them. Nevertheless, machine recommendation can be a valuable tool for optimizing outcomes in sets of data too large to process in traditional ways.

Overview of Haven OnDemand Machine Learning APIs

The Haven OnDemand Prediction APIs provide an easy way to set up predictive analytics. The family includes:

  • The Train Prediction API: train it to detect correlations between sets of conditions and sets of outcomes in your data.
  • The Predict API: to predict outcomes from conditions.
  • The Recommend API: to find the conditions most likely to produce outcomes that you specify.
  • The Get Prediction Model Details: to return the details about trained prediction models.
  • The Delete Prediction Model API: to delete a specified prediction model.
  • In our Labs, you can find previews of two more Machine Learning APIs:
    • Anomaly Detection: trawls your data to see if there are any significant outliers in it.
    • Trend Analysis: detects the trends that differentiate two sets of similarly formatted data, for example, sales between two successive months or years.

The Train Prediction Workflow

Note: The following sections refer to the version 2 prediction APIs.

  1. Validate the data. The API ensures that the data is structured correctly, and that it has the correct data types. In particular, it checks that:
    1. there are no unsupported characters in the data.
    2. the field content types are correct (NUMERIC or STRING)
    3. there are the correct number of values in each row.
  2. Explore the data. To prepare the data for prediction, the API derives additional metadata, to fit with the requirements of the algorithm. The API searches the training data for types, categories, duplications, columns that are not informative, and so on. The API stores this information for later stages of the training.
  3. Split the data. The API splits the initial dataset randomly into two sets. It uses 80% of the data for training, and 20% for testing the models.
  4. Prepare the data. The API prepares the data for training, by normalizing it, converting categories to numbers, filtering duplicate columns, and so on. Every algorithm has specific requirements, and the API prepares the data for each algorithm.
  5. Train the models. The API uses the training data and runs every prediction algorithm multiple times with different parameters. It passes all the resulting models to the testing stage.
  6. Test the models. The API runs the test data with each model, and compares the model result with the actual test values to determine the most accurate statistical method.
  7. Publish the model. The API selects the model with the most accurate results, and publishes it as a prediction model. When the API publishes the model, it uses a name for the prediction model that you choose when you send the Train Prediction API.

The prediction model is available as soon as the training session is finished.

Classification and Regression Models

Haven OnDemand's Train Prediction and Predict APIs support both classification and regression prediction models. The Recommend API supports only classification models.

Classification Models

Use classification prediction models to predict on categories of distinct, limited, non-continuous values, such as professions, voting tendencies or, in the example used here, income brackets. With Haven OnDemand's APIs, predictions on any field of STRING type necessarily use classification models.

To successfully train a classification type model, the prediction_field that you specify must contain categorical values.

Note: The number of records in the entire training dataset must be equal to or greater than the square of the number of different categories in this field. For example, if you have ten possible values in the the prediction_field, you need a training dataset containing at least 100 records to train a prediction model.

Regression Models

Use regression prediction models to predict on continuous numeric values, where fractions are allowed, for example, average life expectancy. You can use regression models only on field values of NUMERIC type. However, note that not all NUMERIC fields require a regression approach. For example, numeric dress and shoe sizes are discontinuous categories.

Haven OnDemand supplies a sample dataset for you to try the regression algorithms. It tracks high-school student grades based on a wide range of criteria, including parental educational levels, pass-times, internet access, alcohol consumption and relationships. The grades are continuous values between 0 and 201 .

Selecting the "Best" Algorithm

Depending on whether you choose a classification or regression model, Haven OnDemand tests a range of algorithms and selects the one that gives the best results. But what is "best"? Statistics offer different measures to determine this. Haven OnDemand lets you select the measure to use, among a range of recognized statistical predictors for either classification or regression types.

Example Use Case

The following sections demonstrate how to train a prediction model and use it to make predictions and recommendations on new data of the same structure. We use a training dataset of over 32,000 records compiled from census data (see Training Data Structure). It classifies people into income groups of "over 50K" and "under 50K". The goal is to predict the income group of a previously unclassified person, and recommend the changes needed to return a specified result for a set of input data.

Train a Predictor

To use the Train Prediction API to train a prediction model, you must provide the dataset, a name for the model, and the name of the field in your dataset containing the values on which you want to make predictions.

You send the Train Prediction API with the following parameters:

  • file: The CSV or JSON file that contains your training data. Select this option to upload a file in your file system from a browser.
    You can also provide this as:
    • a url pointing to the file. (This is option iset up on the Try It page.)
    • a Haven OnDemand object store reference, if it is already present in a store on your account.
    • a json object, to submit JSON data directly.
  • prediction_field: The column in the training datasets that contains the value that you want to predict. In the sample training dataset, this is the class column, which contains the information about the income group.
  • model_name: The name of the model that you want to create. It must be unique.
  • predictor_type: The type of algorithms to use for your data (classification or regression). If you do not specify this parameter, the API trains a classification model.
  • fields: When your data is in JSON format, you must also set the fields parameter to a JSON object that describes the fields and data types in your dataset.
  • selection_strategy: The measure to use for determining which results are best. For classification models, you can select accuracy (the default value), precision, recall or f_measure. For regression models you can choose between mean_square_error (the default value), root_mean_square_error, mean_absolute_error or r_square.

The following example creates a classification prediction model with the name 50KServiceJSON. You can try this yourself via the Try It page of the Train Prediction API:

POST --form "url=https://www.havenondemand.com/sample-content/prediction/v2/50KTrainV2API.json" --form "prediction_field=class" --form "model_name=50KServiceJSON" --form "predictor_type=classification" --form "fields={"fields":[{"name":"age","type":"NUMERIC"},{"name":"workclass","type":"STRING"},{"name":"fnlwgt","type":"NUMERIC"},{"name":"education","type":"STRING"},{"name":"education-num","type":"NUMERIC"},{"name":"marital-status","type":"STRING"},{"name":"occupation","type":"STRING"},{"name":"relationship","type":"STRING"},{"name":"race","type":"STRING"},{"name":"sex","type":"STRING"},{"name":"capital-gain","type":"NUMERIC"},{"name":"capital-loss","type":"NUMERIC"},{"name":"hours-per-week","type":"NUMERIC"},{"name":"native-country","type":"STRING"},{"name":"class","type":"STRING"}]}" --form "selection_strategy=accuracy" --form "apikey=your_apikey" https://api.havenondemand.com/1/api/async/trainpredictor/v2	

This asynchronous request sends the data and parameters and returns a job_id, which you can use to find the status of the training job. For more information about usng the asynchronous API, see Synchronous and Asynchronous API.

Note: Haven OnDemand recommends that you use the asynchronous version of the Train Prediction API, because the training process can take some time.

After the training job finishes, you can use the prediction model with the Predict and Recommend APIs.

Training Data Structure

For training and prediction, the data that you use must be in CSV or JSON format. For recommendation, only JSON is supported.

The sample training data is from a public dataset extracted from the 1994 US Census database, known as the "Adult" dataset. It classifies people into income groups of "over 50K" and "under 50K" 2.

If you want to have a look at the sample datasets, provided for classification model training, before you run the API, you can download them:

It classifies the people according to the following properties:

  • age : continuous.
  • workclass : Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • fnlwgt : continuous, the number of times this combination of properties occurs in the total database.
  • education : Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous. Education in number of years.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation : Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship : Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race : White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  • sex : Female, Male.
  • capital-gain: continuous. Gains from invested monies.
  • capital-loss: continuous. Losses from invested monies.
  • hours-per-week: continuous. Hours worked.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
  • class : >50K. <=50K. The column used in the training.

CSV

To provide training data in CSV format, the file must contain text in rows. The data must have a header row, a data type row, and data rows. On each row, the values must be separated by commas. In addition:

  • For training data, one of the fields must be the class label.
  • The following characters must not appear in the fields and values: , ; \ \n \r.
  • The data type row can contain only the following types: STRING or NUMERIC.
  • Space characters must be present only as part of the text.

The following example shows the start of the CSV file for the training data:

age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
NUMERIC,STRING,NUMERIC,STRING,NUMERIC,STRING,STRING,STRING,STRING,STRING,NUMERIC,NUMERIC,NUMERIC,STRING,STRING
39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K

JSON

To provide the training data in JSON format, the file must contain the JSON object, or you can add the JSON object directly in the json parameter. The JSON must be in the following format:

{ 
dataset: [
	{
		"field1": "stringvalue",
		"field2": value
	},
	{
		"field1": "stringvalue",
		"field2": value
	},
	{
		"field1": "stringvalue",
		"field2": value
	}
	...
]
}

Where field1, field2, and so on are your data fields (which can have any name), and their associated values.

You must also set the fields parameter to provide information about the fields and data types in your data. You set fields to a JSON object in the following format:

{
fields: [
	{
		"name": "field_name",
		"type": "field_type"
	},
	{
		"name": "field_name",
		"type": "field_type"
	}
	...
]
}
  • For the training data, one of the fields must be the class label.
  • The following characters must not appear in the fields and values: , ; \ \n \r.
  • All backslashes in the fields and values must be escaped as a double backslash (\\)
  • The type field for the fields array can contain only the following types: NUMERIC or STRING.
  • Space characters must be present only as part of the text.

The following example shows the start of the JSON file for the training data:

{
"dataset": [
 {
   "age":39,
   "workclass":"State-gov",
   "fnlwgt":77516,
   "education":"Bachelors",
   "education-num":13,
   "marital-status":"Never-married",
   "occupation":"Adm-clerical",
   "relationship":"Not-in-family",
   "race":"White",
   "sex":"Male",
   "capital-gain":2174,
   "capital-loss":0,
   "hours-per-week":40,
   "native-country":"United-States",
   "class":"<=50K"
 },
 { 
   "age":50,
   "workclass":"Self-emp-not-inc",
   "fnlwgt":83311,
   "education":"Bachelors",
   "education-num":13,
   "marital-status":"Married-civ-spouse",
   "occupation":"Exec-managerial",
   "relationship":"Husband",
   "race":"White",
   "sex":"Male",
   "capital-gain":0,
   "capital-loss":0,
   "hours-per-week":13,
   "native-country":"United-States",
   "class":"<=50K"
 },
 {
   "age":38,
   "workclass":"Private",
   "fnlwgt":215646,
   "education":"HS-grad",
   "education-num":9,
   "marital-status":"Divorced",
   "occupation":"Handlers-cleaners",
   "relationship":"Not-in-family",
   "race":"White",
   "sex":"Male",
   "capital-gain":0,
   "capital-loss":0,
   "hours-per-week":40,
   "native-country":"United-States",
   "class":"<=50K"
 },
...	

The following example is the fields array for this data.

{
fields: [
	{
		"name": "age",
		"type": "NUMERIC"
	},
	{
		"name": "workclass",
		"type": "STRING"
	},
	{
		"name": "fnlwgt",
		"type": "NUMERIC"
	},
	{
		"name": "education",
		"type": "STRING"
	},
	{
		"name": "education-num",
		"type": "NUMERIC"
	},
	{
		"name": "marital-status",
		"type": "STRING"
	},
	{
		"name": "occupation",
		"type": "STRING"
	},
	{
		"name": "relationship",
		"type": "STRING"
	},
	{
		"name": "race",
		"type": "STRING"
	},
	{
		"name": "sex",
		"type": "STRING"
	},
	{
		"name": "capital-gain",
		"type": "NUMERIC"
	},
	{
		"name": "capital-loss",
		"type": "NUMERIC"
	},
	{
		"name": "hours-per-week",
		"type": "NUMERIC"
	},
	{
		"name": "native-country",
		"type": "STRING"
	},
	{
		"name": "class",
		"type": "STRING"
	}
	]
}

Train an Optimal Model

To get the best prediction model and the best final results, use the following best practices when you create your training data.

  • When you structure your training data:
    • remove any duplicate fields.
    • use as many records as possible.
    • use as many features as possible.
    • use the same case for categories (the model is case sensitive).
  • When you choose the data for your use case:
    • choose numeric and categorical data where possible.
    • avoid empty values if possible.
    • use epoch time instead of other date formats.

Run the Prediction

The training example above produced the 50KServiceJSON prediction model, which you can use to predict and classify new data. You can try this yourself via the Try It page of the Predict API. You can also download the sample data used in the example, if you want to have a look at it.

To predict data, you use the Predict API, with the following parameters:

  • file: The CSV or JSON file that contains the data that for which you want to predict values. This data must have the same structure as your training data. In the following example, this file is called 50KPredictV2API.json. You can also submit the file via a url (as in our example), a store object reference or as a json object.
  • model_name: The name of the prediction model to use, in this example, 50KServiceJSON, the service created with the JSON file.
  • format: The format in which you want to return the results (JSON or CSV).
  • fields: When your data is in JSON format, you must also set the fields parameter to a JSON object that describes the fields and data types in your dataset. The field definitions must be identical to those used to train the prediction model.

For example:

POST --form "url=https://www.havenondemand.com/sample-content/prediction/v2/50KPredictV2API.json" --form "model_name=50KServiceJSON" --form "format=json" --form "fields={"fields":[{...}]}" https://api.havenondemand.com/1/api/sync/predict/v2 

This request returns the results, in JSON format, which contain the original data, with an additional column that contains the predicted income group. For classification models, the API also returns the confidence of each prediction.

You can also retrieve the result in CSV format.

Find Recommendations

After you have created your prediction model, you can use the Recommend API to detect what changes would be necessary to a set of data to make it deliver a specific outcome.

The Recommend API must be used in conjunction with the Train Prediction API and on identically structured data. It calls the prediction model created with the Train Prediction API as a required input parameter.

Notes:

  • The Recommend API works only with JSON input files.
  • The Recommend API works only with classification prediction model, not regression prediction models.

Example Use Case

For our example we assume that you have already created the 50KServiceJSON prediction model using the JSON dataset.

You can now try the Recommend API from its Try It page. If you want to examine the sample data used in the example, you can download it: 50KRecommendV2API.json.

You can submit any number of records to the API. For records whose predicted value is already equal to the required value, the recommendations return empty. Our sample contains a dataset consisting of a single row of data:

{
	"dataset": [
		{"age":"42","workclass":"State-gov","fnlwgt":"218948","education":"Doctorate","education-num":"16","marital-status":"Divorced","occupation":"Prof-specialty","relationship":"Unmarried","race":"Black","sex":"Female","capital-gain":"0","capital-loss":"0","hours-per-week":"36","native-country":"Jamaica","class":""}
	]
}

In addition to the input data, the Recommend API requires the following other parameters:

  • model_name: The name of a previously created prediction model. In this instance, we select a JSON version of the prediction model previously created, called 50KServiceJSON.
  • required_label: the result we want to achieve with the recommendations, in this instance, >50K.
  • modifiable_features: the fields on which we allow recommendations to be made. Select fields that are pertinent to the recommendation. For this example, we have chosen education, marital-status, occupation, and hours-per-week.
  • fields: the JSON object listing all the fields in the dataset. This must contain field definitions identical to those used to create the model.
  • recommendations_amount: optionally, you can specify a number of recommendations to return.

For example:

POST --form "url=https://www.havenondemand.com/sample-content/prediction/v2/50KRecommendV2API.json" --form "model_name=50KServiceJSON" --form "required_label=>50K" --form "modifiable_features=marital-status" --form "recommendations_amount=1" --form "fields={"fields":[{...}]}" https://api.havenondemand.com/1/api/sync/recommend/v2

Results:

{
 "dataset": [
  {
   "row": {
 	"age": 42,
 	"workclass": "State-gov",
 	"fnlwgt": 218948,
 	"education": "Doctorate",
 	"marital-status": "Divorced",
 	"occupation": "Prof-specialty",
 	"relationship": "Unmarried",
 	"race": "Black",
 	"sex": "Female",
 	"capital-gain": 0,
 	"capital-loss": 0,
 	"hours-per-week": 36,
 	"native-country": "Jamaica",
 	"class": "",
 	"prediction": "<=50K"
   },
   "recommendations": [
 	{
 	  "recommendation": {
 		"age": 42,
 		"workclass": "State-gov",
 		"fnlwgt": 218948,
 		"education": "Prof-school",
 		"marital-status": "Married-civ-spouse",
 		"occupation": "Prof-specialty",
 		"relationship": "Unmarried",
 		"race": "Black",
 		"sex": "Female",
 		"capital-gain": 0,
 		"capital-loss": 0,
 		"hours-per-week": 80,
 		"native-country": "Jamaica",
 		"class": "",
 		"prediction": ">50K"
 	  },
 	  "prediction": ">50K",
 	  "distance": 4.898979485566356,
 	  "confidence": 0.752913493726639
 	}
   ]
  }
 ]
}

Discussion

Haven OnDemand displays the results as a separate JSON array directly under the data on which it is recommending. The modifiable fields specified show their values, modified as required. Additionally, three new items beneath the recommendation show:

  • prediction: the predicted outcome if the conditions in the recommendations are met.
  • distance: The degree of separation between the data as submitted and the recommendation. The smaller the value, the closer are the submitted and recommended records.
  • confidence: a value between 0 and 1 representing the degree of likelihood that the prediction would be realized if the recommendations are met. In the example, this shows as about 75%.

The findings can be more or less significant, depending on how you prepare your data and select your fields. For the modifiable fields chosen, the API recommends a change of marital status from Divorced to Married-civ-spouse, a preference for professional schooling over an academic doctorate, and that the person should work 80 hours per week.

Arguably, you can use any field, depending on what you are trying to achieve. If you are using the power of machine learning to discover correlations that lie outside human preconceptions, you might want to include all fields and see which ones impinge the most on the result. Terms such as recommend or modifiable are just semantic and should not mislead you. For example, attributes that cannot be modified in the real world, such as sex, race or native-country, still yield valid results for research.

On the other hand, your data may require cleaning and normalization. In our sample set, there are redundancies and inconsistencies. If marital_status contains values such as husband and wife, it is not independent from sex. Education in number of years is perhaps redundant with education degree, and so forth. These intellectual considerations lie in the domain of the data scientist.


1 P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

2 Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.