Graph Analytics
An introduction to graph analytics and the Haven OnDemand Graph Analysis APIs

Introduction to Graph Analysis

Graphs allow you to create and explore models of relationships between entities. A graph is a set of edges between nodes, where a node is an entity (such as a concept, place, or web page), and an edge is a connection between two nodes.

Haven OnDemand provides a public graph data index, which is based on the English Wikipedia data set.

Wikipedia provides a useful source of general knowledge, both in terms of the content on each page, and through the hyperlinks between pages. These links show how topics in Wikipedia are related to one another. The Haven OnDemand Graph APIs represent these connections as a graph, which you can use to explore the knowledge base in a connection-oriented approach.

Each Wikipedia page is a node in the graph. The edges of the graph are the hyperlinks between pages. Nodes have a unique name (based on the page title), and a unique ID. Each edge has a direction, so that a link from page A to page B is different to a link from B to A.

The following diagram shows a simplified graph based on the English Wikipedia knowledge map.

You can get an idea of how much information the graph contains by counting the nodes and edges. For example, the graph above has seven nodes and nine edges. The Summarize Graph API returns this information for a graph.

The Summarize Graph API also returns information about any node and edge attributes (that is, information associated with the nodes and edges). In the Wikipedia graph dataset, each node has an in-degree attribute, which shows the number of edges that point to the node. Each edge also has a weight attribute, which represents the conceptual distance between the pair of pages that it links. A low weight indicates that the pages are very similar (an edge between identical pages has zero weight). The Graph APIs automatically calculate the weight for the edges according to an internal weighting algorithm.

Evaluate Concepts

You can use the Get Nodes API to return a list of the nodes in the graph. By default, the nodes are sorted by the in-degree for the node (that is, the number of edges that point to the specified node). This API allows you to find the most important nodes in the graph. For Wikipedia, the number of pages that link to a particular page might be an indication of its popularity, or the importance of its concepts.

In the example graph, Italy, Physics, and France are the most important, all with two pages that link to them. United Kingdom is a lone node, with no links to or from it.

Evaluate Relationships

The Get Neighbors API allows you to find nodes that a source node links to, or that link to a particular target node. You can also use the Get Common Neighbors API to find common neighbors for two or more nodes. For Wikipedia, the links that users add to and from a page can provide useful context. You can find related concepts by finding pages that two or more related pages link to.

In the example graph, the out-neighbors of France (that is, the pages that France links to), correspond to the geographical neighbors of the country. You might consider Italy a more important neighbor, because more pages link to it than Germany. France and Italy also share a common out-neighbor (Germany), while Germany does not share any common out-neighbors with France.

The Suggest Links API allows you to find nodes in the graph that are likely to be related to a specified node, but that do not have a direct link to it in the graph. This might indicate related concepts.

In the example graph, Italy is not directly related to United States, but they are connected and related by Germany. This might reflect global relationships.

Entity Disambiguation

The Get Shortest Path API allows you to find a path of links between two specified nodes.

There are different ways to calculate the shortest path, and the measure parameter allows you to select different methods. The weighted measure finds the path that follows the edges with the lowest weight. The weight of an edge indicates the similarity of the two nodes that it links, and an edge between very similar nodes has a low weight. The uniform measure treats all edges as equal, and finds the path that uses the fewest edges. The inverseweighted measure allows you to find the path that follows the edges with the highest weight (that is, a path between the least similar pages).

You can use the shortest path information to disambiguate between potential meanings of an ambiguous term, by finding the shortest path between the potential meanings and other terms that provide context, and avoiding incidental connections.

Another possible use is to identify anomalies. For example, you might want to flag documents in another dataset that contain two concepts that have a long path between them in the Wikipedia graph.

Other Applications

The Get SubGraph API allows you to find a subset of the total graph, by providing the nodes that you want to include. It returns the nodes you provide, and any edges between them. You can use this in combination with the other APIs to simplify the whole Wikipedia dataset into smaller knowledge maps.

You can use the Graph APIs to provide context to query results, for example, by providing several of the neighboring pages for a result page. You can also create simple games involving finding routes through wikipedia pages, such as degrees-of-separation games, and comparing the graph route between two places to a map route.