Monday, June 30, 2025

Introducing the DataFrame API for Desk-Valued Features

Desk-Valued Features (TVFs) have lengthy been a robust software for processing structured knowledge. They permit capabilities to return a number of rows and columns as a substitute of only a single worth. Beforehand, utilizing TVFs in Apache Spark required SQL, making them much less versatile for customers preferring the DataFrame API.

We’re happy to announce the brand new DataFrame API for Desk-Valued Features. Customers can now invoke TVFs straight inside DataFrame operations, making transformations easier, extra composable, and totally built-in with Spark’s DataFrame workflow. That is out there in Databricks Runtime (DBR) 16.1 and above.

On this weblog, we’ll discover what TVFs are and the best way to use them, each with scalar and desk arguments. Contemplate the three advantages in utilizing TVTs:

Key Advantages

  • Native DataFrame Integration: Name TVFs straight utilizing spark.tvf., without having SQL.
  • Chainable and Composable: Mix TVFs effortlessly along with your favourite DataFrame transformations, similar to .filter(), .choose(), and extra.
  • Lateral Be part of Assist (out there in DBR 17.0): Use TVFs in joins to dynamically generate and broaden rows primarily based on every enter row’s knowledge.

Utilizing the Desk-Valued Operate DataFrame API

We’ll begin with a easy instance utilizing a built-in TVF. Spark comes with helpful TVFs like variant_explode, which expands JSON constructions into a number of rows.

Right here is the SQL method:

And right here is the equal DataFrame API method:

As you may see above, it’s easy to make use of TVFs both approach: by way of SQL or the DataFrame API. Each provide the similar end result, utilizing scalar arguments.

Accepting Desk Arguments

What if you wish to use a desk as an enter argument? That is helpful whenever you wish to function on rows of information. Let us take a look at an instance the place we wish to compute the length and prices of journey by automobile and air.

Let’s think about a easy DataFrame:

We want our class to deal with a desk row as an argument. Be aware that the eval methodology takes a Row argument from a desk as a substitute of a scalar argument.

With this definition of dealing with a Row from a desk, we will compute the specified end result by sending our DataFrame as a desk argument.

Or you may create a desk, register the UDTF, and use it in a SQL assertion as follows:

Alternatively, you may obtain the identical end result by calling the TVF with a lateral be part of, which is beneficial with scalar arguments (learn under for an instance).

Taking it to the Subsequent Degree: Lateral Joins

You can too use lateral joins to name a TVF with a whole DataFrame, row by row. Each Lateral be part of and Desk Arguments assist is out there within the DBR 17.0.

Every lateral be part of helps you to name a TVF over every row of a DataFrame, dynamically increasing the info primarily based on the values in that row. Let’s discover a few examples with greater than a single row.

Lateral Be part of with Constructed-in TVFs

For instance now we have a DataFrame the place every row accommodates an array of numbers. As earlier than, we will use variant_explode to blow up every array into particular person rows.

Right here is the SQL method:

And right here is the equal DataFrame method:

Lateral Be part of with Python UDTFs

Generally, the built-in TVFs simply aren’t sufficient. You could want customized logic to rework your knowledge in a particular approach. That is the place Person-Outlined Desk Features (UDTFs) come to the rescue! Python UDTFs help you write your individual TVFs in Python, supplying you with full management over the row enlargement course of.

This is a easy Python UDTF that generates a sequence of numbers from a beginning worth to an ending worth, and returns each the quantity and its sq.:

Now, let’s use this UDTF in a lateral be part of. Think about now we have a DataFrame with begin and finish columns, and we wish to generate the quantity sequences for every row.

Right here is one other illustrative instance of the best way to use a UDTF utilizing a lateralJoin [See documentation] with a DataFrame with cities and distance between them. We wish to increase and generate a more recent desk with further data similar to time to journey between them by automobile and air, together with further prices in airfare.

Let’s use our airline distances DataFrame from above:

We are able to modify our earlier Python UDTF from above that computes the length and value of journey between two cities by making the eval methodology settle for scalar arguments:

Lastly, let’s name our UDTF with a lateralJoin, giving us the specified output. In contrast to our earlier airline instance, this UDTF’s eval methodology accepts scalar arguments.

Conclusion

The DataFrame API for Desk-Valued Features offers a extra cohesive and intuitive method to knowledge transformation inside Spark. We demonstrated three approaches to make use of TVFs: SQL, DataFrame, and Python UDTF. By combining TVFs with the DataFrame API, you may course of a number of rows of information and obtain bulk transformations.

Moreover, by passing desk arguments or utilizing lateral joins to Python UDTFs, you may implement particular enterprise logic for particular knowledge processing wants. We confirmed two particular examples of reworking and augmenting your enterprise logic to provide the specified output, utilizing each scalar and desk arguments.

We encourage you to discover the capabilities of this new API to optimize your knowledge transformations and workflows. This new performance is out there within the Apache Spark™ 4.0.0 launch. If you’re a Databricks buyer, you need to use it in DBR 16.1 and above.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles