Record updates in Azure Data Explorer are finally here

Record updates in Azure Data Explorer are finally here

Article Data Platform Solutions Data & AI

Those familiar with Azure Data Explorer (ADX) or following my blog series on this technology, which you can find here, will know it is not possible to perform updates. There are some ways to get it done via Materialized Views or record deletes + appends in a database script, but those methods have some disadvantages. In my opinion the inability to easily perform record updates has been a major disadvantage of ADX and plenty questions on this topic can be found online. In a recent project we ended up fully reloading 20 tables in slowly changing dimension type 2 (SCD2) every day… You can imagine my surprise when I found the ‘update table command’ in the February 2024 release notes and immediately took it for a spin.

This article contains my understanding and lessons learned regarding the ADX’s new update capability. I created a dummy setup in which staged data is merged into a SCD-2 dimensional table just like I could have used last year. All code I used will be within the article and everything was executed on a free ADX cluster. I will not cover the update syntax itself as that is thoroughly documented already.

The theory on updating records in ADX

The update statement has two versions referred to as simplified and expanded syntax. But under the hood they both work based on a dataset which defines what must be deleted and another dataset which will be appended. The simplified syntax simply uses the same dataset to define deletions as well as appends. The update statement works by atomically deleting records and appending records to a table. Very similar to the execution of:

.execute database script <| 
.delete table target_table records <| RECORDS TO DELETE; 
.set-or-append target_table <| RECORDS TO APPEND; 

So, it really is not updating records one would expect from a SQL perspective in which records matching a predicate are updated ‘in place’. Given ADX’s columnar append-only storage architecture it is not a surprise but still something to be aware of when ‘updating’ data. The way it works has some elements to be aware of:

  • The queries / data defining the records to delete and to append must have the full table schema. It is not possible to update a single column, instead all columns must be provided.
  • Since records are not updated in place but instead ingested they get a new ingestion_time which must be considered in relation with retention policies.
  • The update can change the number of rows in the table. This can happen when more records are deleted than appended or vice-versa. This would never happen with an SQL UPDATE.

Lets take a look at this example from the documentation:

.update table People on Name <| 
  datatable(Name:string, Address:string)[ 
  "Alice", "2 Macquarie Street", 
  "Diana", "350 Fifth Avenue" ] 
  | where true

Effectively, this will delete all the records where Name is ‘Alice’ or ‘Diana’. There can be zero but also hundreds of records matching that criteria and all will be deleted. Only the two records specified will be appended. So you know for sure that after the append there will be exactly two records matching those names.

The predicate ‘where true’ is used throughout the documentation but I found that it is not (always) needed and honestly I do not understand the purpose of it. Maybe this will change over time.

Concrete SCD-2 example

Multiple use cases for data updates come to mind but maintaining dimension tables is the most important one for me. There are multiple ways to model such dimension but a popular one used a lot is ‘type 2’ which tracks all changes over time. Technically this is mostly done by adding ‘start_at’ and ‘end_at’ datetime columns which define the time range in which a record was ‘active’. The end_at of an ‘active’ record is typically NULL or high like 2099-12-31. A concrete SCD2 example for machines with some configuration is shown below. Such a table allows for time travel to find the configuration for each machine at any point in time.

ID config start_at end_at
A 1 2024-01-01 2024-01-23
A 2 2024-01-23 2024-03-18
A 3 2024-03-18 NULL
B 1 2023-12-24 2024-02-14
B 2 2024-02-14 NULL

Every time a new record for an existing ID arrives the old record must be updated by setting the end_at and the new record must be appended. Traditionally this could only be done using a materialized view or database script, both could be quite complex. But now we can use the new update statement so that is what I did.

The simple example below generates some dummy staging data and merges it into a dimension table using SCD2. Conceptually, it works by getting all new records from the staging table and union these with existing records for the same ID’s from the target dimension table. Having all existing and old records for the same ID’s together we can deduplicate and align them based on time, set their end_at, and ‘update’ them in the target dim table. I was forced to use a temporary table as the update statement itself does not allow a union. Using this method the data can arrive out of order, include duplicates and still be aligned properly in the dimension table.

// create some dummy data to be merged with the target dimension
.set-or-replace stg_simulation <|
  range i from 1 to 1000000 step 1
  | project id=i
  | extend type = tostring(dynamic(["Customer", "Employee", "Boss"])[id %3]),
    color = tostring(dynamic(["Red", "Blue", "Gray", "Orange", "Purple", "Yellow"])[id % 6]),
    guid = new_guid(),
    creation_date = datetime(2000-01-01)
.execute database script with ( ContinueOnErrors = false) <|
    .drop table tmp_simulation ifexists // drop temp table if it exists
    ;
    .set-or-append dim_simulation <| // create the target dimension table if it does not yet exist
        stg_simulation
        | project id, type, color, guid, start_at = creation_date, end_at = datetime(null)
        | take 0
    .alter table dim_simulation policy roworder (id asc, start_at asc) // optionally set some policies for the table
    ;
    .set-or-replace tmp_simulation <| // create temp table with all records to append to the dim table (combination of new and existing records)
        dim_simulation
        | lookup kind=inner (stg_simulation | project id) on id // get all records from dim table for ID's which have updates
        | union ( // add new records to the set
            stg_simulation
            | extend start_at = creation_date
            | project-away creation_date
            )
        | partition hint.strategy=shuffle by id ( // put all records per ID in the right order and set the end_at field
            summarize take_any(*) by start_at // remove duplicates per ID as these may exist on repeated loading
            | order by start_at asc
            | extend new_end_at = next(start_at) // can also check for explicit delete flag from source and set the end_date accordingly
        )
        | where end_at != new_end_at or (isnull(end_at) and isnull(new_end_at)) // only keep changed records
        | project id, type, color, guid, start_at, end_at = new_end_at; // make sure the output matches the target schema
    .update table dim_simulation delete delete_these append append_these <|
        let delete_these = dim_simulation | join kind=inner tmp_simulation on id, start_at // simply remove all records in the update set based on ID and start_at
        ;
        let append_these = tmp_simulation | where true // simply use the temp table we created above
        ;
    .drop table tmp_simulation ifexists // drop temp table at the end for cleanup

I executed this script several times on a free ADX cluster to do an initial load followed by updates. To get a feel for the performance, I ran the following steps and logged their runtimes:

  • Load first 1M into empty target dim took 24 sec (simply appending initial data)
  • Merge 100K into the existing 1M took 4.6 sec resulting in 1.1M records
  • Merge same 100K into 1.1M took: 4.7 sec (no effect as duplicates are removed!)
  • Merge 1M into 1.1M took 35 sec resulting in 2.1M records
  • Merge 1M record in between the existing 2.1M took: 42 sec (out of order data)
  • Merge same 1M records into 3.1M took 43 sec (no effect as duplicates are removed)
  • Merge 10K records into 3.1M took 2.1 sec

I tried the same tests using 10M+ records but failed as the free cluster lacked memory for it. Merging 10M records with 10M records having the same ID pushes 20M records in memory for deduplication and ordering and proved too much for the free cluster.

Conclusion

Overall the update statement is an useful addition to KQL, especially when an orchestration solution like Data Factory is present. It allows for a more flexible way to perform batch loading and updating. For example, to maintain dimensional data or mimic a materialized-view (for whatever reason…). At the same time it has quite some ‘quirks’ to consider like:

  • Only allowing extend, join, order, project, take and where statements which oftentimes are not enough. Although not ideal it is possible to use a temp table to work around this limitation as described in this article
  • A maximum of 5 million records can be deleted in a single update execution. This severely impacts is use as plenty of scenarios would go over this limit or have the risk to do so which must be taken into account. It should be possible to split large updates into smaller pieces, for example using percentiles or modulo methods on the id. But this would make the entire update process non-atomic and would generally be ill advised.
  • Not updating data at all but instead deleting and appending data which might impact data retention. Heavy use of updates might create a lot of tombstones which can put additional burden on data housekeeping (i.e. merging extents)
  • Requiring a separate solution to execute update statements. Often we have a Data Factory alongside ADX to do this but it would be more powerful if updates can be part of update policies. This would allow for ‘streaming merge’ patterns like Databrick’s apply changes. However, I do not expect this to happen soon as it would require handling parallel updates and is something ADX generally does not support

Stay up to date!

Are you enjoying this content? Sign up for our (Dutch) Newsletter to get highlighted insights written by our experts.


Corné Versloot
By Corné Versloot
Data & AI Consultant

15 Apr 2024
Knowledge Hub overview