Skip to the content.

Spark encoders, implicits and custom encoders

One of the nice things about Spark SQL is that you can reference datasets as if they were like statically-typed collections of Scala case classes. However, Spark datasets do not natively store case class instances; Spark has its own internal format for representing rows in datasets. Conversion happens on demand in something called an encoder. When you write code like this:

Read More

Run azcopy from AWS Fargate

Microsoft provides the azcopy tool for copying data between Azure storage accounts and AWS S3. If you’re otherwise serverless or fully containerized, and don’t already have an EC2 instance up, it makes sense to run azcopy in a Fargate task.

Read More

Spark mistakes I made

I built a Spark process to extract a SQL Server Database to Parquet files on S3, using an EMR cluster. I am using as much parallelism as possible, extracting both multiple tables at a time and splitting tables up into partitions to be extracted in parallel. My goal is to size the EMR cluster and number of total parallel threads to the point where I saturate the SQL Server.

Read More