Microsoft has recently added Graph Analysis capabilities as a preview feature to Azure Data Explorer (ADX). It is a totally new ‘area’ within ADX so I went to check it out and this is my first impression using a bit of a dummy scenario.
This is the second article in our series about Azure Data Explorer. You can find an introduction into Azure Data Explorer here, if you want to know more about this Azure service and its typical use cases.
Essentially the Graph Analysis features can extract patterns from many-to-many, hierarchical, or networked relations. Examples mentioned by Microsoft or that I can think of are:
- Social network analysis – friend of a friend type of queries
- Analysis of network traffic – finding paths between machines/IP (ranges)
- Investment fund value / risk– Funds owning parts of Funds owning parts of assets with specific value at a specific moment in time.
- Analysis of physical networks like power lines, computer networks, roads, railways etc.
- Getting a complete overview of all tables used by each KQL Function (i.e. functions using functions using data). This is the dummy scenario I used to check out the new Graph functions, see below for the description.
From a technical perspective the Graph capabilities can potentially help in any situation which would require recursive joining to solve (joining the same tables over and over to establish all relationships).
It is important to understand that ADX has not become a Graph database and there are no options to store data in graph like structures. Instead, you can transform data into a graph representation during query execution to answer graph like questions. Conceptually this is the same as using the ADX timeseries data representation via ‘make-series’ to do things like anomaly detection and forecasting. At the time of writing ADX only has four graph operators, of which the ‘graph-match’ operator is used to find patterns in a graph. The others transform tabular data into a graph, vice-versa or merge graphs. The graph must fit within the memory of a single node which can be a serious limitation.
I tried the graph functions to solve the dummy challenge described below. It uses some of ADX’s system tables so anyone with sufficient privileges on an ADX cluster can execute it as well. To make this interesting for myself I executed everything on a database with a large number of functions and tables. However, this database also holds some sensitive data which is why I did not put any screenshots in this post. Just use the KQL snippets I used which you can try yourself.
Let’s take it for a spin
Since I did not have a challenge at hand suitable for graph analysis, I created a simple one for myself to work on:
Find out which tables are directly and indirectly used by each KQL Function in a database.
Although a dummy exercise it can still be handy to see which functions may require work when we change a table schema. It seems rather simple but without the graph functions it becomes tricky as you must make assumptions on the maximum ‘functions-calling-functions depth’. To solve this using graph analysis, I created a graph as shown below which contains all functions and the objects (functions, tables, materialized views, etc.) they access. Using the ‘graph-match’ operation it is possible to create an overview which tables are accessed by which functions and vise-versa. The diagram below provides a conceptual overview. When for example the schema of table 3 is changed, we may have to update functions B, C, D and E.
The first step is to create the graph representation which requires a definition of the edges with optional definition of the nodes. Within this scenario the nodes are all functions, tables, materialized-views and external tables present in the database. For this I created a small table called ‘entities’ as shown below.
.set-or-replace entities <| .show databases entities | where DatabaseName == 'mydatabase' | project DatabaseName, EntityName, EntityType
The edges are all functions and the entities they use. For simple functions which directly access tables this is easy, you won’t even need to join. For functions calling a function which access data a single join would be required. But beyond this it becomes harder, and graphs come in handy. I used the query below to get the edges as a ‘mapping’ from functions to all entities they use by parsing the function body.
.set-or-replace edges <|
.show functions // split the body of each function into ‘words’
| extend words = split(replace_regex(Body, '\W+', '<->'), '<->')
// from all words only keep the ones exactly matching an entity
| extend entity = set_intersect(words, toscalar(entities | summarize make_list(EntityName))) // the set intersect
proofed faster than explode + join
// explode and project the result to get a simple table: Function à ChildEntity
| mv-expand entity to typeof(string)
| project Function = Name, ChildEntity = entity
With two tables in place representing the Nodes and Edges I could make the graph and query it as shown below. Note that in this text I refer to ‘tables’ but these include external tables and materialized views as well.
edges
// Make graph from all functions to their child entities with their type
// The Function, ChildEntity and EntityName refer to columns holding the node id’s
| make-graph Function --> ChildEntity with nodes on EntityName
// find all paths of at most 5 steps from a function node to a data node
| graph-match (function)-[usage*1..5]->(data)
// only allow endpoints in the match which hold data, i.e. everything not a Function
where data.EntityType != 'Function'
// define what the match function should return. In this case the function, table it accesses and the ‘call path’ via other functions
project Function = function.EntityName, Path = usage.ChildEntity , Table = data.EntityName, DataType = data.EntityType
| project Function, Path, Steps = array_length(Path), Table, DataType
Within the result each record has one relation between a Function and a Table used within it. The Path and Steps elements contain the actual nodes between the Function and the table. In most cases there is one step, the function directly accessing a table. But sometimes the path was longer; in my case maximum 3 steps (function -> function -> function -> table).
This result can be summarized further to get an overview of all data used within functions.
| summarize TableCount = count(), MaxDepth = max(Steps), TablesUsed = make_list(Data) by Function
In my case most Functions access only one or two tables with a maximum depth of one. But there is one Function using 7 tables with a maximum depth of 3 which means this is potentially a complex one to maintain as it has a lot of dependencies.
In reality most functions we use do not access the dataset directly in their body but instead get the data to work with as parameters. For those functions the method above does not work as their bodies do not contain references to known tables. For those cases the ADX query log can be used which contains the actual values for the parameters. The queries executed, including the functions used with their parameters, can be parsed and used in the same way.
To conclude
Microsoft has added another useful function to the already impressive set of features supported by Azure Data Explorer. Like most other functions it provides the basics needed to further understand and analyze the data. I can imagine it is rather limited from broader graph analytics point and will probably not work on large graphs as it must fit within a single node’s memory. Obviously, it does not replace a purpose build graph database, instead it simply provides a new analytical capability which I am sure will be surprisingly useful to have.
Stay up to date!
Are you enjoying this content? Sign up for our (Dutch) Newsletter to get highlighted insights written by our experts.