Problem
After making changes to a stateful Structured Streaming operation, you encounter a failure.
Example
The exact error message displayed depends on your state store provider. The following example demonstrates using dropDuplicates()
on a streaming DataFrame with a watermark in RocksDB.
# original code:
streaming_df = streaming_df.dropDuplicates(subset=["column1"])
# later changed to:
streaming_df = streaming_df.dropDuplicates(subset=["column1", "column2"])
org.apache.spark.sql.execution.streaming.state.StateStoreKeySchemaNotCompatible: [STATE_STORE_KEY_SCHEMA_NOT_COMPATIBLE] Provided key schema does not match existing state key schema.
Please check number and type of fields.
Existing key_schema=StructType(StructField(column1,IntegerType,true)) and new key_schema=StructType(StructField(column1,IntegerType,true),StructField(column2,TimestampType,true)).
Cause
Stateful operations in streaming queries require maintaining state data to continuously update results.
Structured Streaming automatically checkpoints this state data to fault-tolerant storage like HDFS, AWS S3, or Azure Blob storage and restores it after restart. However, this assumes the state data schema remains unchanged across restarts.
For more information on recovery semantics after changes in a streaming query, refer to the Apache Spark Structured Streaming Programming Guide.
Solution
Only make changes to stateful code if you are willing to create a new checkpoint.