SQL Server, JDBC and compiled query plans

I had a query like this

Spark mistakes I made

I built a Spark process to extract a SQL Server Database to Parquet files on S3, using an EMR cluster. I am using as much parallelism as possible, extracting both multiple tables at a time and splitting tables up into partitions to be extracted in parallel. My goal is to size the EMR cluster and number of total parallel threads to the point where I saturate the SQL Server.

Code reuse - REST API vs. embedded library

There are two popular ways of packaging code for reuse (besides copy-paste):

Another look back at C

I just worked on C code again for the first time in a long while. I’ve worked almost exclusively in some kind of managed runtime environment – Java, C#, Python, or Javascript in a browser. So working on C again was like going through a bit of a time warp. I spent a while trying to figure out how to connect the old stuff I used to know with the new. Like, are gcc, gdb etc. all still there? And how do I get them to work with VS.Code, and on MacOS?

My first impression of Go

I recently learned Golang so I could contribute to some Terraform and AWS-related projects.

Maven dependency conflict resolution is annoying

With Spark development, I am frequently running into dependency conflicts because of Maven’s “nearest wins” strategy for resolving transitive dependencies.

SQL Server tempdb on EC2 instance storage, on Linux

Normally I’m a big fan of using managed AWS services like RDS, Redshift and Aurora, so you don’t have to be in the business of managing your own database. Still, there are some some edge cases where you need finer-grained control over storage, and running a DB like SQL Server on EC2 makes sense. AWS makes a SQL Server AMI for Linux available on the marketplace.

Weird Hive and Spark SQL discrepancy with varchar truncation

I found an edge case where Hive SQL and Spark SQL will produce different results on a basic SELECT col FROM table query.

See Redshift queries behind cursor fetch

By default, the Redshift ODBC/JDBC drivers will fetch all result rows from a query. If your result sets are large, you may have ended up using the UseDeclareFetch and Fetch parameters. But if you do this, you won’t see your actual queries in the STL_QUERY table or Redshift console. Instead you will see that the actual long-running query looks like

MSTR Web development over SOCKS proxy and AWS SSM

The problem: You want to work on MicroStrategy (MSTR) Web customizations in your local Eclipse/Tomcat environment, but you don’t have connectivity to your I-server. Your I-server lives in AWS, your corporate network blocks outbound port 22 access, and you don’t have a VPN or direct connect.

Serverless Flask with Terraform

I followed Andrew Griffith’s blog post to deploy a serverless Flask app, with Lambda and API Gateway, managing the AWS resources with Terraform.

New AWS SSM feature to tunnel SSH with port forwarding support

AWS SSM already had a “session manager” feature that allowed users to get command prompts through a web browser. The big advantage this had over providing an SSH bastion host is that SSM is covered by the same governance context as other AWS services: authentication and authorization via IAM, with audit via CloudTrail.

Timestamp and timezone confusion with Spark, Parquet and Redshift

Most of the time developers don’t give much thought to timezones. In a data-center based application, ETL, DB, and applications all run in the same timezone, so there are limited opportunities for discrepancies.

Asynchronous callbacks with AWS Step Functions and Lambda

AWS Step Functions recently added support for callback patterns for long-running tasks.

How I used middle-school arithmetic to solve a Redshift migration issue

One of the most annoying issues I encountered while migrating from Oracle to Redshift could be solved with middle-school arithmetic.

Bill Schneider

where I write about software engineering. All opinions are my own

SQL Server, JDBC and compiled query plans

Spark mistakes I made

Code reuse - REST API vs. embedded library

Another look back at C

My first impression of Go

Maven dependency conflict resolution is annoying

SQL Server tempdb on EC2 instance storage, on Linux

Weird Hive and Spark SQL discrepancy with varchar truncation

See Redshift queries behind cursor fetch

MSTR Web development over SOCKS proxy and AWS SSM

Serverless Flask with Terraform

New AWS SSM feature to tunnel SSH with port forwarding support

Timestamp and timezone confusion with Spark, Parquet and Redshift

Asynchronous callbacks with AWS Step Functions and Lambda

How I used middle-school arithmetic to solve a Redshift migration issue