I found it helpful to create Spark UDFs to make it easier to migrate logic in SQL from another database like SQL Server.
I began to learn Scala specifically to work with Spark. The sheer number of language features in Scala can be overwhelming, so, I find it useful to learn Scala features one by one, in context of specific use cases. In a sense I’m treating Scala like a DSL for writing Spark jobs.
I managed to get Spark to run on Windows in local mode, and to submit jobs to an EMR cluster in AWS.
Working in an agile development environment, I noticed some parallels to my experiences with student theater several decades ago.
Suppose you want to write a report in MSTR to calculate some measure, filtering on patients who had a certain diagnosis and are not on a certain medication. Both Diagnosis and Medication have many-to-many relationships with Patient, and patient count is a metric defined at or below the patient level.
Suppose you have a dimensional model about travel prices like this
AWS is transparent that Redshift’s distributed architecture entails a fixed cost every time a new query is issued. The documentation says the impact “might be especially noticeable when you run one-off (ad hoc) queries.”
I am not passionate about functional programming for its own sake. I am passionate about readability and clean code, and functional programming is a tool to help get there. I take a pragmatic approach: functional programming should be a tool in your toolbox, and you should be ready to use when it makes your life easier. At the same time, I don’t feel a rush to Pure Function and Immutable All The Things. I like languages that support functional programming but don’t strictly require it.
Different languages have different opinions about what to treat as “truthy” or “falsy” when using a non-boolean object as an expression inside an
In PostgreSQL, you can convert child records to look like a nested collection of objects on the parent record. This is useful if you want to convert a relational-style parent-child model into a document style, with the child records represented as a composite within the parent document.
Redux is a library and a pattern for managing state in front-end applications. It is typically associated with React but it can be used with other frameworks as well. Instead of directly modifying state, components dispatch actions, which are then handled by reducers. Reducers take current (immutable) state plus the action to produce a new state. The new state is then wired into React component properties, which triggers re-rendering. This interaction is shown in the diagram below:
One of the things I liked about this post on “Senior Engineers Reduce Risk” is how it called out two different kinds of project risks:
When multiple (small, agile) teams are working on the same codebase, it can be tempting to create a branch for each team so they can work in isolation without impacting each other. Don’t do it. Teams working in isolated branches may appear to make faster progress, but it is an illusion–in reality, work in an isolated branch can’t be delivered without getting through some big scary merge in the future. The hard work of integration is just being kicked down the road and gets harder the longer it waits. There is no substitute for coordination and communication.
To transform data within a single SQL Server, with source and target data in the same database, it is probably faster to use an
INSERT statement than a SSIS data flow task.
In a lean/agile environment, you avoid Big Design Up Front (BDUF) in favor of getting something small and simple to work first, then iterating as you learn more.