Pandas UDFs in Apache Spark

One of the reasons I’ve preferred Scala for working with Spark, is the ability to define complex logic in a UDF without as big of a performance penalty as Python UDFs. (Aside from the performance risk of UDFs in general.)

Spark UDFs Revisited

A while back I wrote about some of the caveats of using UDFs.

Experiment with OpenAI for lab mapping to LOINC codes

I did an experiment using Azure OpenAI to map free-text short descriptions of lab tests along with their units (“hemoglobin g/dL”) to industry-standard LOINC codes. I was surprised to see I got a decent result with minimal effort: somewhere in the ballpark of 70% match against known results, with only a few hours playing around with the prompt.

Notebooks in production vs. JARs and Python Wheels

Most cloud-native frameworks for Apache Spark (Databricks, Microsoft Fabric, AWS EMR, etc.) provide a way to not only work on notebooks interactively, but to automate running those notebooks in production.

To BI or not to BI? Here are some questions to ask

Should I use a BI/reporting tool or build data visualizations myself? is an evergreen question, and the answer is always “it depends.”

More Spark workarounds for recursive CTEs

I previously wrote about the lack of recursive CTEs in Spark SQL for parent/child hierarchies.

Spark UDFs are good except when they're not

I am a fan of using Spark UDFs and/or map functions for complex business logic where using the full power of Scala or Java gives better readability or performance than relying on SQL set operations alone.

Provisioning Azure Databricks workspace with Terraform

Terraform provides an azurerm_databricks_workspace resource to create an Azure Databricks workspace.

Power BI embed tokens, with row-level security and cross-report drillthroughs

Power BI provides REST APIs to generate embed tokens, which you can use to authenticate end users to Power BI even if they are not authenticated to your Active Directory tenant. This is known as embed for your customers: your application uses a service principal to call Power BI to generate an embed token, and the end user can then use that embed token to execute Power BI reports.

Synapse POC and comparison with Azure Databricks

I took a look at Azure Synapse and compared it against Azure Databricks.

Use Power BI REST APIs to automate replacing Databricks and Snowflake credentials

Power BI provides a comprehensive set of REST APIs for automating various processes with Power BI.

Spark encoders, implicits and custom encoders

One of the nice things about Spark SQL is that you can reference datasets as if they were like statically-typed collections of Scala case classes. However, Spark datasets do not natively store case class instances; Spark has its own internal format for representing rows in datasets. Conversion happens on demand in something called an encoder. When you write code like this:

Workaround for Spark lack of recursive CTE support

Problem

Spark arithmetic discrepancies with SQL Server

I ran into a number of numeric discrepancies migrating an ETL process from Microsoft SQL Server to Apache Spark. Some of the same principles may apply to any relational database.

Run azcopy from AWS Fargate

Microsoft provides the azcopy tool for copying data between Azure storage accounts and AWS S3. If you’re otherwise serverless or fully containerized, and don’t already have an EC2 instance up, it makes sense to run azcopy in a Fargate task.

Bill Schneider

where I write about software engineering. All opinions are my own

Pandas UDFs in Apache Spark

Spark UDFs Revisited

Experiment with OpenAI for lab mapping to LOINC codes

Notebooks in production vs. JARs and Python Wheels

To BI or not to BI? Here are some questions to ask

More Spark workarounds for recursive CTEs

Spark UDFs are good except when they're not

Provisioning Azure Databricks workspace with Terraform

Power BI embed tokens, with row-level security and cross-report drillthroughs

Synapse POC and comparison with Azure Databricks

Use Power BI REST APIs to automate replacing Databricks and Snowflake credentials

Spark encoders, implicits and custom encoders

Workaround for Spark lack of recursive CTE support

Problem

Spark arithmetic discrepancies with SQL Server

Run azcopy from AWS Fargate