Spark UDFs Revisited
A while back I wrote about some of the caveats of using UDFs.
A while back I wrote about some of the caveats of using UDFs.
I did an experiment using Azure OpenAI to map free-text short descriptions of lab tests along with their units (“hemoglobin g/dL”) to industry-standard LOINC codes. I was surprised to see I got a decent result with minimal effort: somewhere in the ballpark of 70% match against known results, with only a few hours playing around with the prompt.
Most cloud-native frameworks for Apache Spark (Databricks, Microsoft Fabric, AWS EMR, etc.) provide a way to not only work on notebooks interactively, but to automate running those notebooks in production.
Should I use a BI/reporting tool or build data visualizations myself? is an evergreen question, and the answer is always “it depends.”
I previously wrote about the lack of recursive CTEs in Spark SQL for parent/child hierarchies.
I am a fan of using Spark UDFs and/or map
functions for complex business logic
where using the full power of Scala or Java gives better readability or performance
than relying on SQL set operations alone.
Terraform provides an azurerm_databricks_workspace
resource to create an Azure Databricks
workspace.
Power BI provides REST APIs to generate embed tokens, which you can use to authenticate end users to Power BI even if they are not authenticated to your Active Directory tenant. This is known as embed for your customers: your application uses a service principal to call Power BI to generate an embed token, and the end user can then use that embed token to execute Power BI reports.
I took a look at Azure Synapse and compared it against Azure Databricks.
Power BI provides a comprehensive set of REST APIs for automating various processes with Power BI.
One of the nice things about Spark SQL is that you can reference datasets as if they were like statically-typed collections
of Scala case classes. However, Spark datasets do not natively store case class instances; Spark has its own internal format
for representing rows in datasets. Conversion happens on demand in something called an encoder
. When you write code
like this:
I ran into a number of numeric discrepancies migrating an ETL process from Microsoft SQL Server to Apache Spark. Some of the same principles may apply to any relational database.
Microsoft provides the azcopy
tool for copying data between Azure storage accounts and AWS S3. If you’re otherwise serverless or
fully containerized, and don’t already have an EC2 instance up, it makes sense to run azcopy
in a Fargate task.
I had a query like this