Learning Scala for Spark, or, what's up with that triple equals?

I began to learn Scala specifically to work with Spark. The sheer number of language features in Scala can be overwhelming, so, I find it useful to learn Scala features one by one, in context of specific use cases. In a sense I’m treating Scala like a DSL for writing Spark jobs.

»
Author's profile picture Bill Schneider

Getting Spark on Windows to connect to AWS EMR cluster

I managed to get Spark to run on Windows in local mode, and to submit jobs to an EMR cluster in AWS.

»
Author's profile picture Bill Schneider

What does agile development have in common with musical theater?

Working in an agile development environment, I noticed some parallels to my experiences with student theater several decades ago.

»
Author's profile picture Bill Schneider

Using MicroStrategy reports as filters for exists and not exists subqueries

Suppose you want to write a report in MSTR to calculate some measure, filtering on patients who had a certain diagnosis and are not on a certain medication. Both Diagnosis and Medication have many-to-many relationships with Patient, and patient count is a metric defined at or below the patient level.

»
Author's profile picture Bill Schneider

Reporting across different attribute roles in MSTR

Suppose you have a dimensional model about travel prices like this

»
Author's profile picture Bill Schneider

Measuring AWS Redshift Query Compile Latency

AWS is transparent that Redshift’s distributed architecture entails a fixed cost every time a new query is issued. The documentation says the impact “might be especially noticeable when you run one-off (ad hoc) queries.”

»
Author's profile picture Bill Schneider

Pragmatic Functional Programming

I am not passionate about functional programming for its own sake. I am passionate about readability and clean code, and functional programming is a tool to help get there. I take a pragmatic approach: functional programming should be a tool in your toolbox, and you should be ready to use when it makes your life easier. At the same time, I don’t feel a rush to Pure Function and Immutable All The Things. I like languages that support functional programming but don’t strictly require it.

»
Author's profile picture Bill Schneider

Opinions on truthiness across languages

Different languages have different opinions about what to treat as “truthy” or “falsy” when using a non-boolean object as an expression inside an if statement.

»
Author's profile picture Bill Schneider

Comparing PostgreSQL json_agg and Spark collect_list

In PostgreSQL, you can convert child records to look like a nested collection of objects on the parent record. This is useful if you want to convert a relational-style parent-child model into a document style, with the child records represented as a composite within the parent document.

»
Author's profile picture Bill Schneider

Don't Redux All The Things

Redux is a library and a pattern for managing state in front-end applications. It is typically associated with React but it can be used with other frameworks as well. Instead of directly modifying state, components dispatch actions, which are then handled by reducers. Reducers take current (immutable) state plus the action to produce a new state. The new state is then wired into React component properties, which triggers re-rendering. This interaction is shown in the diagram below:

»
Author's profile picture Bill Schneider

Balancing early and later project risks

One of the things I liked about this post on “Senior Engineers Reduce Risk” is how it called out two different kinds of project risks:

»
Author's profile picture Bill Schneider

Multiple teams, one codebase

When multiple (small, agile) teams are working on the same codebase, it can be tempting to create a branch for each team so they can work in isolation without impacting each other. Don’t do it. Teams working in isolated branches may appear to make faster progress, but it is an illusion–in reality, work in an isolated branch can’t be delivered without getting through some big scary merge in the future. The hard work of integration is just being kicked down the road and gets harder the longer it waits. There is no substitute for coordination and communication.

»
Author's profile picture Bill Schneider

SSIS data flow vs. insert-select

To transform data within a single SQL Server, with source and target data in the same database, it is probably faster to use an INSERT statement than a SSIS data flow task.

»
Author's profile picture Bill Schneider

Throwaway code isn't always wasteful

In a lean/agile environment, you avoid Big Design Up Front (BDUF) in favor of getting something small and simple to work first, then iterating as you learn more.

»
Author's profile picture Bill Schneider

Microstrategy Visual Insights live connection issues

Microstrategy’s data import feature is promising, because it can shorten the time to insight. Instead of architecting a project schema up front (attributes, facts, metrics), you can import tables directly into a dataset (Intelligent Cube).

»
Author's profile picture Bill Schneider