Introduction to Trend Analysis

Trend Analysis extracts the most important trends from a dataset. It is particularly suited for business intelligence applications, where you want to understand changes between two points in time. Trend Analysis returns a list of the most important trends, and groups the trends into collections of related trends.

Example Use Case

You might have a dataset that describes sales for a French supermarket chain in 2014 and 2015. You can use Trend Analysis to find the most significant trends in the data. Maybe sales of cottage cheese in Lyon were significantly lower in 2015 compared to 2014. Maybe shoppers in Paris were responsible for a much greater proportion of all fruit sales in 2015 compared to 2014.

Trend Analysis first identifies the categories that exist in your dataset, and identifies the players within those categories. In the previous example, "cottage cheese" would be a player in the category "Lyon", but "Lyon" would also be a "player" in the category "cottage cheese". Trend Analysis then aggregates data about your categories. For example, your input data might come from a database that contains raw data describing sales or transactions. Haven OnDemand can aggregate data about each category, for example the number of sales or the value of those sales.

The API then runs trend analysis and returns the most significant categorical trends in your dataset.

Use Trend Analysis

Data Structure

The Trend Analysis API requires an input file in comma-separated values (CSV) format.

The file must contain text in rows. The first row of the file must be a header that provides column names. The second row must describe the data type of each column (STRING or NUMERIC). The following rows must be data rows. On each row, the values must be separated by commas.

  • The following characters must not appear in the fields and values: , ; \ \n \r.
  • The data type row can contain only the following types: STRING, NUMERIC.
  • Space characters must be present only as part of the text.

The API expects the data to include the following columns:

Column Type Number of columns Description
Group Column Exactly one. The group column contains values that are used to split the dataset into two groups that you want to compare. For example, in our sample data the group column could be named "Year" and you could compare the groups 2014 and 2015.
Attribute column Zero or more. These columns specify information about a record in a specific group, for example "product name", "product type", "country", "city", and so on. If you don't specify any attribute columns, the API uses all columns in the table as attribute columns, with the exception of the group column and the aggregation column.
Aggregation column Zero or one. Aggregation columns contain values that you want to aggregate, such as "units sold" or "sale/transaction value". The values in this column are used to identify trends and must be of the type integer or float. Specifying an aggregation column is optional, and if you do not specify one the API considers every record to have the same value (for example, each record might represent one sale or transaction rather than a sale value).

Analysis Process

Trend Analysis performs the following operations:

  1. Validate the request. Your request must contain:

    • A CSV file that contains the data to analyze.
    • The name of the column to use as the group column. For example, "Year".
    • The names of the groups to compare, for example "2014" and "2015".
    • Optionally, the names of the columns to use as attribute columns.
    • Optionally, the name of a column to use as the aggregation column.
    • A list of terms that specify categories that you are interested in.
  2. Validate the data.

    • The data must not contain unsupported characters.
    • The values must match the specified data types.
    • There must be the correct number of values in each row.
  3. Split the data. The data is split into your selected groups, using the values in the specified group column.

  4. Find all attribute combinations. Trend Analysis inspects the specified attribute columns, and finds all possible combinations of values for those columns. Each unique combination of values is considered a player to be analyzed. Using the previous example, "cottage cheese" would be a player in the category "Lyon", and "Paris" would be a player in the category "fruits".

  5. Compute trends. Taking the first group, Trend Analysis aggregates records for each player in each category, and calculates the percentage (share) each player has in the category. This is then repeated for the second group. The trends are the respective differences between the second-group percentages and the first-group percentages.

  6. Compute scores. For each player in each category, Trend Analysis computes a trend score. The trend score is based on the percentage change of the player from the first group to the second, but is also influenced by the category size and category activity, which reduces the chance of returning big trends in insignificant categories.

  7. Cluster the trends into collections. Similar trends are clustered into collections.

  8. Return the top trend collections. The API returns the collections of trends with the highest scores. If you specified a list of terms that you are interested in, the API only returns trends that are associated to one or more of those terms.

Example

Consider the following sample data, which relates to sales by a Canadian store (truncated - the complete data file is used on the Trend Analysis API page):

Year,Sales,Province,Customer Segment,Product Sub-Category
2011,261.54,Nunavut,Small Business,Eldon Base for stackable storage shelf; platinum
2012,10123.02,Barry French,Consumer,1.7 Cubic Foot Compact "Cube" Office Refrigerators
2012,244.57,Barry French,Consumer,Cardinal Slant-D® Ring Binder; Heavy Gauge Vinyl

The sample data contains the following columns:

  • Year
  • Sales (values in Canadian dollars)
  • Province
  • Customer Segment
  • Product Sub-Category

The column "Year" can serve as the group column, so that we can compare sales between 2011 and 2012. The column "Sales" can serve as the aggregation column, so that we can see the trend in the value of sales. The remainder of the columns are attribute columns.

The API might return the following trends:

  1. "Product Sub-Category: Tables" dropped from 16.0% to 9.0% in category "Customer Segment: Corporate",
  2. "Customer Segment: Corporate" grew from 36.4% to 40.0%,
  3. "Product Sub-Category: Office Machines" grew from 12.4% to 15.1%.

Each of these three top trends are associated to additional similar trends in their corresponding collections. For instance, the second trend in the second collection is: "Customer Segment: Corporate" grew from 35.8% to 50.6% in "Province: British Columbia".

Run Trend Analysis

To run trend analysis, send a request to the API with the following parameters:

  • file. A CSV file that contains the data to analyze.
  • groups_column. The name of the column to use as the group column. For example, "Year".
  • main_group. The first group to analyze. This group is used as the baseline. For example, "2014".
  • compared_group. The second group to analyze. This group is compared to the baseline. For example, "2015".
  • columns (optional) The names of the columns to use as attribute columns.
  • aggregation_column (optional). The name of a column to use as the aggregation column.
  • terms (optional) A list of terms that specify categories that you are interested in.

For example:

/1/api/async/analysecategoriestrend/v1?file=data.csv&groups_column=year&main_group=2014&compared_group=2015&aggregation_column=sales&terms=furniture