Skip to the content.

Weird Hive and Spark SQL discrepancy with varchar truncation

I found an edge case where Hive SQL and Spark SQL will produce different results on a basic SELECT col FROM table query.

Let’s say you have DDL on an underlying Parquet file with a VARCHAR column.

-- Hive DDL
CREATE TABLE truncate_demo (
   col VARCHAR(10)
) STORED AS 'PARQUET' LOCATION ....

Now, let’s suppose the Parquet file has a value that overflows the specified length in the above DDL. For example, the value "foo bar baz" is 11 characters.

If you run a query select col, length(col) from truncate_demo in Hive directly through Beeline, it will show a length of 10 and the string truncated to 10 characters.

If you run the same query through Spark SQL it will show the full length of the string (11) and display the string without truncation.

I found this discrepancy unexpected.

Note that the same does not apply to flat files/CSV with STORED AS 'TEXTFILE'; in that case, both Spark and Hive will truncate equally.

Written on April 9, 2020