Comparing PostgreSQL json_agg and Spark collect_list
In PostgreSQL, you can convert child records to look like a nested collection of objects on the parent record. This is useful if you want to convert a relational-style parent-child model into a document style, with the child records represented as a composite within the parent document.
JSON functions allow conversion of result set rows to and from JSON. The
json_agg function aggregates a list of records into a JSON array of objects. An example of what this would look like:
select parent.id, parent.name, json_agg(child.*) as nested_child from parent join child on parent.id = child.parent_id group by parent.id, parent.name
The resulting JSON column could be stored in a
You can do the same thing in Spark SQL as well with the
select parent.id, parent.name, collect_list(struct(child.*)) as nested_child from parent join child on parent.id = child.parent_id group by parent.id, parent.name
A key difference between Spark arrays/structs and PostgreSQL JSON: Spark SQL is a two-step process. First
struct converts a list of fields into a single
struct object on each child record. Then
collect_list aggregates the structs from the child records into an array per parent record in the
group by. PostgreSQL
json_agg is a single step.
The similarity is that the resulting dataframe can be stored with the nested array of structs in Parquet or Avro format.