Article Data & AI

SQL gets a boost: Introducing Pipe syntax in Databricks

SQL gets a boost: Introducing Pipe syntax in Databricks

Among the many exciting updates introduced in Apache Spark 4.0, one feature in particular stood out to me: the new Pipe SQL syntax. As someone who has worked extensively with Kusto Query Language (KQL), I immediately recognized the power and elegance of this approach. The ability to express SQL queries in a clean, top-down, composable way is a game-changer for productivity and readability. In this post, I want to share my experience with Pipe SQL, why it resonated with me, and how it can transform the way you query and process data.

TL;DR

The new Pipe SQL syntax in Databricks introduces a more intuitive, readable, and modular way to write SQL queries. By chaining operations with the |> operator, developers can express logic in a top-down, sequential flow, similar to DataFrame APIs. This approach eliminates the need for deeply nested subqueries, reduces redundancy, and improves maintainability. It also opens the door to extensibility with new operators like:

  • EXTEND the schema by adding one or multiple columns (PySpark withColumn function)
  • SET the value one or multiple columns (also done by the PySpark withColumn function)
  • DROP columns as a shorthand for the SELECT * EXCEPT(…) statement
  • AS table to store an intermediate result to use later in the query. For example, to combine later results with previous ones (has more uses than you might expect)

Drawing parallels with KQL used in various Azure products, Pipe SQL brings a modern, composable structure to traditional SQL without sacrificing compatibility or performance.

The limitations of SQL vs the power of Dataframes

SQL has been the backbone of data querying for over 50 years, but it comes with some long-standing pain points also described in Google’s 2024 paper, “SQL Has Problems. We Can Fix Them: Pipe Syntax in SQL”.

  • Rigid clause order: You must write SELECT … FROM … WHERE … GROUP BY …, even though that’s not how the data flows. Starting with the data and work your way towards the final result is more natural.
  • Nested subqueries: Multi-step logic often requires deeply nested queries, making code hard to read and debug. Fortunately many SQL languages have adopted Common Table Expressions (CTE) to fix this issue but the Pipe syntax also helps.
  • Poor composability: You can’t easily chain operations like you can in DataFrame APIs. I sometimes struggle when switching between Spark and SQL because of this totally different way of thinking about data.

In contrast, DataFrames (like in Pandas or PySpark) let you build queries incrementally with less restrictions like: df.filter(…).groupBy(…).agg(…).withColumn(…).drop(…).orderBy(…).select(…)

This model is intuitive, readable, and modular, something SQL has historically lacked. Google’s paper demonstrates how pipe-structured data flow can modernize SQL without abandoning its foundations.

Enter Pipe SQL: A more natural way to write queries

The new pipe syntax brings this composability to SQL. Instead of writing queries inside-out, you can now write them top-down, using the |> operator to chain operations. You start with an input dataset and start building your way towards the intended result. As the structure is so flexible you can keep adding statements to the end as the examples below show. For debugging it is often required to inspect intermediate results. With SQL this can be hard given its structure and dependencies between clauses like SELECT and GROUP BY. But with the new syntax the last few statements can easily be ‘commented out’ to see the result up to that point. Statements can incrementally be added / uncommented to fix upstream mistakes.

Example: Traditional SQL vs Pipe SQL

Standard SQL:

SELECT c_count, COUNT(*) AS custdist
FROM (
  SELECT c_custkey, COUNT(o_orderkey) AS c_count
  FROM customer
  LEFT JOIN orders ON c_custkey = o_custkey
  WHERE o_comment NOT LIKE '%unusual%packages%'
  GROUP BY c_custkey
) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;

Pipe SQL:

FROM customer
|> LEFT JOIN orders ON c_custkey = o_custkey
|> AGGREGATE COUNT(o_orderkey) AS c_count GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;

Note the sequence of aggregations which is more verbose in SQL.

With a SQL join you can decide where you want to limit the columns from the right hand table; either in the JOIN or in the top level select (using the join alias). With the Pipe SQL we only have the first which can be confusing if you are used to using aliases in the top level select. So a join would become like this:

|> LEFT JOIN (
              |> FROM orders
              |> SELECT col1, col2, ...
) ON c_custkey = o_custkey

This isn’t new: KQL has been around for years

Even though Google’s paper gave a major push to this evolution of SQL such a ‘piped’ query language already exists for many years. People working in the Azure stack are very likely to have used it already as it is the query language of Log Analytics, Azure Data Explorer, Microsoft Sentinel and Fabric Eventhouse. This language is called Kusto Query Language (KQL) and even has a pipe symbol as statement separator. So Google ‘borrowed’ this from Microsoft (who probably just ‘borrowed’ it from someone else…). For those not familiar with KQL it looks like this, starting with a table (StormEvents) and ‘piping’ statements to it.

StormEvents
| summarize StormCount _count = count() by State, bin(StartTime, 1h)
| oder by StormCount desc

Having used KQL and Azure Data Explorer extensively in my work, I can attest to its power and flexibility. The ability to chain operations in a readable, top-down manner has made complex queries much easier to manage and debug. I’ve found that this approach not only improves productivity but also makes the code more maintainable and understandable for others. Seeing Databricks adopt a similar syntax for SQL is exciting, as it brings these benefits to a broader audience.

Final thoughts

The new pipe syntax in Databricks is a game-changer for SQL developers. It brings the clarity of DataFrames, the composability of functional programming, and the familiarity of SQL into one powerful syntax. It is easy to start with the Pipe syntax without forcing you to abandon the SQL ecosystem you already know as the two flavors can be combined within a single query. Trust me, it really is worth learning even if you currently feel like ‘why the F*&% should I learn yet another language, trusted old SQL is good enough for me’…

The increased flexibility does increase the risk of writing expensive queries which are hard(er) to optimize by the engine. So when using this syntax always be on the lookout for improvements. Most importantly moving statements which decrease the data size (where, select, drop) to the beginning as much as possible. For development purposes I find it handy to downsize the data using the TABLESAMPLE clause and simply remove it to test for performance.

As a side note, the Kusto Detective Agency is a great way to learn KQL and familiarize yourself with such a query language. It involves solving increasingly complex data challenges using the KQL language and some of it advanced features such as graph analysis and geo-spatial functions. If you love working with data you will love this!

Corné Versloot

Corné Versloot

Data & AI Consultant