I recently learned Golang so I could contribute to some Terraform and AWS-related projects.
With Spark development, I am frequently running into dependency conflicts because of Maven’s “nearest wins” strategy for resolving transitive dependencies.
Normally I’m a big fan of using managed AWS services like RDS, Redshift and Aurora, so you don’t have to be in the business of managing your own database. Still, there are some some edge cases where you need finer-grained control over storage, and running a DB like SQL Server on EC2 makes sense. AWS makes a SQL Server AMI for Linux available on the marketplace.
I found an edge case where Hive SQL and Spark SQL will produce different results on a basic
SELECT col FROM table query.
By default, the Redshift ODBC/JDBC drivers will fetch all result rows from a query.
If your result sets are large, you may have ended up using the
Fetch parameters. But if you do this, you won’t see your actual queries in
STL_QUERY table or Redshift console. Instead you will see that the
actual long-running query looks like
The problem: You want to work on MicroStrategy (MSTR) Web customizations in your local Eclipse/Tomcat environment, but you don’t have connectivity to your I-server. Your I-server lives in AWS, your corporate network blocks outbound port 22 access, and you don’t have a VPN or direct connect.
I followed Andrew Griffith’s blog post to deploy a serverless Flask app, with Lambda and API Gateway, managing the AWS resources with Terraform.
AWS SSM already had a “session manager” feature that allowed users to get command prompts through a web browser. The big advantage this had over providing an SSH bastion host is that SSM is covered by the same governance context as other AWS services: authentication and authorization via IAM, with audit via CloudTrail.
Most of the time developers don’t give much thought to timezones. In a data-center based application, ETL, DB, and applications all run in the same timezone, so there are limited opportunities for discrepancies.
AWS Step Functions recently added support for callback patterns for long-running tasks.
One of the most annoying issues I encountered while migrating from Oracle to Redshift could be solved with middle-school arithmetic.
DATE type is both a date and a time down to the second level.
Sometimes it can be useful to introspect MSTR report definitions to make assertions about them for unit tests, or to extract other metadata about their structure.
Spark can access files in S3, even when running in local mode, given AWS
credentials. By default, with
s3a URLs, Spark will search for credentials
in a few different places:
I’ve worked with both old-school ETL tools (Informatica, SSIS), and more recently worked with Spark. My takeaway is that AWS Glue is a mash-up of both concepts in a single tool.