Learning Scala for Spark, or, what's up with that triple equals?
I began to learn Scala specifically to work with Spark. The sheer number of language features in Scala can be overwhelming, so, I find it useful to learn Scala features one by one, in context of specific use cases. In a sense I’m treating Scala like a DSL for writing Spark jobs.
Let’s pick apart a simple fragment of Spark-Scala code: dataFrame.filter($"age" === 21)
.
There are a few things going on here:
-
The
$"age"
creates a SparkColumn
object referencing the column namedage
within in a dataframe. The$
operator is defined in an implicit classStringToColumn
. Implicit classes are a similar concept to C# extension methods or mixins in other dynamic languages. The$
operator is like a method added on to theStringContext
class. -
The triple equals operator
===
is normally the Scala type-safe equals operator, analogous to the one in Javascript. Spark overrides this with a method inColumn
to create a newColumn
object that compares theColumn
to the left with the object on the right, returning a boolean. Because double-equals (==
) cannot be overridden, Spark must use the triple equals. -
The
dataFrame.filter
method takes an argument ofColumn
, which defines the comparison to apply to the rows in theDataFrame
. Only rows that match the condition will be included in the resultingDataFrame
.
Note that the actual comparison is not performed when the above line of code executes! Spark methods like filter
and select
– including the Column
objects passed in–are lazy. You can think of a DataFrame like a query builder pattern, where each call builds up a plan for what Spark will do later when a call like show
or write
is called. It’s similar in concept to something like IQueryable
in LINQ, where foo.Where(row => row.Age == 21)
builds up a plan and an expression tree that is later translated to SQL when rows must be fetched, e.g., when ToList()
is called.