You have a Structured Streaming job running via the S3-SQS connector. Suppose you want to recreate the source SQS, backed by SNS data, and you want to proceed with a new queue to be processed in the same job and in the same output directory.
Use the following procedure:
- Create new SQS queues and subscribe to s3-events (from SNS). At this point, the same messages are in both the old and new queues.
- Set the option allowOverwrites to false in the new streaming job and start running it.
- Take an overlap of a short time interval greater than the trigger time and shut down the old job.
Why does this work?
With SQS Stream, Apache Spark maintains the file paths in the checkpoint directory. And if you set allowOverwrites to false (the default is true), one of the fetches will be discarded while running both queues simultaneously. In this event, files are not reprocessed and there aren’t any duplicates or loss of data.