Should I use a BI/reporting tool or build data visualizations myself? is an evergreen question, and the answer is always “it depends.”
I previously wrote about the lack of recursive CTEs in Spark SQL for parent/child hierarchies.
I am a fan of using Spark UDFs and/or
map functions for complex business logic
where using the full power of Scala or Java gives better readability or performance
than relying on SQL set operations alone.
Terraform provides an
azurerm_databricks_workspace resource to create an Azure Databricks
Power BI provides REST APIs to generate embed tokens, which you can use to authenticate end users to Power BI even if they are not authenticated to your Active Directory tenant. This is known as embed for your customers: your application uses a service principal to call Power BI to generate an embed token, and the end user can then use that embed token to execute Power BI reports.
I took a look at Azure Synapse and compared it against Azure Databricks.
Power BI provides a comprehensive set of REST APIs for automating various processes with Power BI.
One of the nice things about Spark SQL is that you can reference datasets as if they were like statically-typed collections
of Scala case classes. However, Spark datasets do not natively store case class instances; Spark has its own internal format
for representing rows in datasets. Conversion happens on demand in something called an
encoder. When you write code
I ran into a number of numeric discrepancies migrating an ETL process from Microsoft SQL Server to Apache Spark. Some of the same principles may apply to any relational database.
Microsoft provides the
azcopy tool for copying data between Azure storage accounts and AWS S3. If you’re otherwise serverless or
fully containerized, and don’t already have an EC2 instance up, it makes sense to run
azcopy in a Fargate task.
I had a query like this
I built a Spark process to extract a SQL Server Database to Parquet files on S3, using an EMR cluster. I am using as much parallelism as possible, extracting both multiple tables at a time and splitting tables up into partitions to be extracted in parallel. My goal is to size the EMR cluster and number of total parallel threads to the point where I saturate the SQL Server.
There are two popular ways of packaging code for reuse (besides copy-paste):