Spark UDFs Revisited
A while back I wrote about some of the caveats of using UDFs.
Beyond the concerns about null handling and performance, there are a few additional points to keep in mind:
-
Not everyone is familiar with Spark array functions like filter and transform. Some UDFs may be replaced with these native Spark SQL functions.
-
There is now less of a penalty for using UDFs in Python, because of vectorized UDFs with Pandas. Before, Python UDFs would perform worse than UDFs in Scala or Java, because of the overhead of serializing from the JVM to the Python interpreter.
-
On the other hand, if you’re using Databricks, UDFs will prevent a given Spark SQL job from using Photon Acceleration.
When you want to use UDFs anyway
There are some times when should still use UDFs. My current rule of thumb is, use UDFs if and only if there isn’t a straightforward
solution in Spark SQL (or Dataframe operations), after taking full account of what you can do with window functions and array operators like filter
and transform
.
Some examples:
-
Running batch inference on a dataframe, perhaps with the the MLFlow Spark API.
-
Business logic that operates on a group of records at a time, where you might need to make multiple passes over the records in a group or where processing one record in a group requires tracking state over how the other records in the group have been processed.
Some examples of complex business logic that justifies a UDF:
-
Pairing up rows in a transaction ledger where positive/negative amounts cancel each other out. Doing this in Python, Scala or Java code on a list of objects is straightforward. (I could imagine it as a Leetcode-style interview question.) Doing this in SQL would be tricky, because you have to keep track of whether a given record was already paired with another to prevent using it again.
-
Clinical quality measures, where you have different measures with different criteria that would translate to combinations of
exists
/not exists
logic. This is unwieldy to do in SQL alone and might end up with more reads of source data than you’d like. The code becomes simpler and more readable, if instead you use a UDF that takes aPatientInfo
object that contains a full record for the patient (lists of encounters, diagnosis codes, procedure codes etc.) and can iterate over the collections multiple times. (ThePatientInfo
object would be analogous to a FHIR bundle.)
I would not use UDFs when SQL window functions would have been straightforward. For example, comparing a row to a subtotal within a group of rows, is exactly why SQL window functions exist.