Power BI provides REST APIs to generate embed tokens, which you can use to authenticate end users to Power BI even if they are not authenticated to your Active Directory tenant. This is known as embed for your customers: your application uses a service principal to call Power BI to generate an embed token, and the end user can then use that embed token to execute Power BI reports.
I took a look at Azure Synapse and compared it against Azure Databricks.
Power BI provides a comprehensive set of REST APIs for automating various processes with Power BI.
One of the nice things about Spark SQL is that you can reference datasets as if they were like statically-typed collections
of Scala case classes. However, Spark datasets do not natively store case class instances; Spark has its own internal format
for representing rows in datasets. Conversion happens on demand in something called an
encoder. When you write code
I ran into a number of numeric discrepancies migrating an ETL process from Microsoft SQL Server to Apache Spark. Some of the same principles may apply to any relational database.
Microsoft provides the
azcopy tool for copying data between Azure storage accounts and AWS S3. If you’re otherwise serverless or
fully containerized, and don’t already have an EC2 instance up, it makes sense to run
azcopy in a Fargate task.
I had a query like this
I built a Spark process to extract a SQL Server Database to Parquet files on S3, using an EMR cluster. I am using as much parallelism as possible, extracting both multiple tables at a time and splitting tables up into partitions to be extracted in parallel. My goal is to size the EMR cluster and number of total parallel threads to the point where I saturate the SQL Server.
There are two popular ways of packaging code for reuse (besides copy-paste):
I recently learned Golang so I could contribute to some Terraform and AWS-related projects.
With Spark development, I am frequently running into dependency conflicts because of Maven’s “nearest wins” strategy for resolving transitive dependencies.
Normally I’m a big fan of using managed AWS services like RDS, Redshift and Aurora, so you don’t have to be in the business of managing your own database. Still, there are some some edge cases where you need finer-grained control over storage, and running a DB like SQL Server on EC2 makes sense. AWS makes a SQL Server AMI for Linux available on the marketplace.
I found an edge case where Hive SQL and Spark SQL will produce different results on a basic
SELECT col FROM table query.