Comparing PostgreSQL json_agg and Spark collect_list

In PostgreSQL, you can convert child records to look like a nested collection of objects on the parent record. This is useful if you want to convert a relational-style parent-child model into a document style, with the child records represented as a composite within the parent document.

JSON functions allow conversion of result set rows to and from JSON. The json_agg function aggregates a list of records into a JSON array of objects. An example of what this would look like:

select parent.id, parent.name, json_agg(child.*) as nested_child
from parent 
join child on parent.id = child.parent_id
group by parent.id, parent.name

The resulting JSON column could be stored in a jsonb column.

You can do the same thing in Spark SQL as well with the struct and collect_list functions:

select parent.id, parent.name, collect_list(struct(child.*)) as nested_child
from parent 
join child on parent.id = child.parent_id
group by parent.id, parent.name

A key difference between Spark arrays/structs and PostgreSQL JSON: Spark SQL is a two-step process. First struct converts a list of fields into a single struct object on each child record. Then collect_list aggregates the structs from the child records into an array per parent record in the group by. PostgreSQL json_agg is a single step.

The similarity is that the resulting dataframe can be stored with the nested array of structs in Parquet or Avro format.

Written on March 30, 2017

Bill Schneider

where I write about software engineering. All opinions are my own

Comparing PostgreSQL json_agg and Spark collect_list