Unit testing Spark Dataframe transformations

Parts of Spark jobs that do pure-functional transformations on dataframes or RDDs – independent of I/O – are ideal candidates for unit testing.

I demonstrated this in a trivial project here. This project has two main classes: one that takes a dataframe and generates a new dataframe with an additional column; and another that tests that transformation. This uses another library, which provides a trait (SharedSparkContext) for setting up a default local-testing SparkContext. Spark will then run in local mode on a workstation without any Hadoop ecosystem dependencies.

The main code for creating a dataframe programmatically looks like this:

val sqlContext = new SQLContext(sc)
    val rows = Seq[Row](
        Row(1),
        Row(2),
        Row(3)
        ).asJava
    val schema = StructType(Seq(StructField("a", IntegerType)))
   
    val df = sqlContext.createDataFrame(rows, schema)

(There is probably a better way to construct the input row list.)

The test can be run through sbt or through Eclipse with Scala IDE.

To get up and running from a clean Eclipse environment (assuming you already have Eclipse and a JDK installed):

  • install Scala
  • install sbt
  • install Scala-IDE for Eclipse (should include scalatest support)
  • The project already includes a plugins.sbt file with dependency on sbteclipse plugin, so that should get picked up when you run “sbt eclipse” for the first time.
  • “sbt test” will run tests, or you can run them through Eclipse with “Run As…” (Ctrl-F11)