Stateful Structured Streaming jobs fail after making changes to stateful operations

Avoid changes to stateful operations between restarts.

Written by brock.baurer

Last published at: November 17th, 2024

Problem

After making changes to a stateful Structured Streaming operation, you encounter a failure. 

 

Example 

The exact error message displayed depends on your state store provider. The following example demonstrates using dropDuplicates() on a streaming DataFrame with a watermark in RocksDB. 

 

# original code:
streaming_df = streaming_df.dropDuplicates(subset=["column1"])

# later changed to:
streaming_df = streaming_df.dropDuplicates(subset=["column1", "column2"])

 

org.apache.spark.sql.execution.streaming.state.StateStoreKeySchemaNotCompatible: [STATE_STORE_KEY_SCHEMA_NOT_COMPATIBLE] Provided key schema does not match existing state key schema.
Please check number and type of fields.
Existing key_schema=StructType(StructField(column1,IntegerType,true)) and new key_schema=StructType(StructField(column1,IntegerType,true),StructField(column2,TimestampType,true)).

 

Cause

Stateful operations in streaming queries require maintaining state data to continuously update results. 

Structured Streaming automatically checkpoints this state data to fault-tolerant storage like HDFS, AWS S3, or Azure Blob storage and restores it after restart. However, this assumes the state data schema remains unchanged across restarts.

For more information on recovery semantics after changes in a streaming query, refer to the Apache Spark Structured Streaming Programming Guide

 

Solution

Only make changes to stateful code if you are willing to create a new checkpoint.