Streaming (AWS)

These articles can help you with Structured Streaming and Spark Streaming (the legacy Apache Spark streaming feature).

58 Articles in this category

Append output is not supported without a watermark

Append output mode is not supported on aggregated DataFrames without a watermark....

Last updated: May 17th, 2022 by Adam Pavlacka

Apache Spark DStream is not supported

DStreams are not supported in Databricks. Migrate from DStream API to Structured Streaming....

Last updated: May 17th, 2022 by Adam Pavlacka

Streaming with File Sink: Problems with recovery if you change checkpoint or output directories

Learn how to resolve issues that occur with recovery if you change checkpoint or output directories when streaming with File Sink....

Last updated: May 17th, 2022 by Adam Pavlacka

Get the path of files consumed by Auto Loader

Get the path and filename of all files consumed by Auto Loader and write them out as a new column....

Last updated: May 18th, 2022 by Adam Pavlacka

How to set up Apache Kafka on Databricks

Learn how to set up Apache Kafka on Databricks....

Last updated: May 18th, 2022 by Adam Pavlacka

Handling partition column values while using an SQS queue as a streaming source

...

Last updated: May 18th, 2022 by Adam Pavlacka

How to restart a structured streaming query from last written offset

Learn how to restart a structured streaming query from the last written offset....

Last updated: May 18th, 2022 by Adam Pavlacka

How to switch a SNS streaming job to a new SQS queue

...

Last updated: May 18th, 2022 by Adam Pavlacka

Kafka error: No resolvable bootstrap urls

A 'No resolvable bootstrap urls' error occurs when you try to read or write data to a Kafka stream....

Last updated: May 18th, 2022 by Adam Pavlacka

readStream() is not whitelisted error when running a query

readStream() is not whitelisted error on clusters that have table access control enabled....

Last updated: May 19th, 2022 by mathan.pillai

Checkpoint files not being deleted when using display()

Learn how to prevent display(streamingDF) checkpoint files from using a large amount of storage....

Last updated: May 19th, 2022 by Adam Pavlacka

Checkpoint files not being deleted when using foreachBatch()

Learn how to prevent foreachBatch() checkpoint files from using a large amount of storage....

Last updated: May 19th, 2022 by Adam Pavlacka

Conflicting directory structures error

You should use distinct paths in the storage location, otherwise conflicting directory structures may result in an error....

Last updated: May 19th, 2022 by ashish

RocksDB fails to acquire a lock

When using RocksDB as a state store, you may need to increase the acquire timeout in the SQL config....

Last updated: February 25th, 2023 by Adam Pavlacka

Stream XML files using an auto-loader

Stream XML files on Databricks by combining the auto-loading features of the Spark batch API with the OSS library Spark-XML....

Last updated: May 19th, 2022 by Adam Pavlacka

Streaming job using Kinesis connector fails

A streaming job writing to a Kinesis sink fails with out of memory error because HTTP clients are not getting terminated....

Last updated: May 19th, 2022 by ashish

Streaming job gets stuck writing to checkpoint

Streaming job appears to be stuck even though no error is thrown. You are using DBFS for checkpoint storage, but it has filled up....

Last updated: May 19th, 2022 by Jose Gonzalez

Explicit path to data or a defined schema required for Auto loader

If you do not specify an explicit path to your data or define your data schema, you get an IllegalArgumentException error when you start an Auto loader job....

Last updated: October 12th, 2022 by Jose Gonzalez

Optimize streaming transactions with .trigger

Use .trigger to define the storage update interval. A higher value reduces the number of storage transactions....

Last updated: October 26th, 2022 by chetan.kardekar

Structured streaming jobs slow down on every 10th batch

Automatic compaction of the metadata folder can slow down structured streaming jobs....

Last updated: October 28th, 2022 by gopinath.chandrasekaran

Get last modification time for all files in Auto Loader and batch jobs

Define a UDF to list all files in the path and return the last modification time for each one....

Last updated: December 1st, 2022 by DD Sharma

Stream to stream join failure

Avoid using a memory sink when running streaming queries with stream to stream join....

Last updated: January 18th, 2024 by harikrishnan.kunhumveettil

Offset reprocessing issues in streaming queries with a Kafka source

Resolve Kafka offset reprocessing issues in Structured Streaming by using a new checkpoint directory....

Last updated: January 19th, 2024 by harikrishnan.kunhumveettil

Autoloader job fails with a URISyntaxException error due to invalid characters in filenames

When using Directory listing mode you should not process files with colons in the filename. ...

Last updated: January 19th, 2024 by harikrishnan.kunhumveettil

Auto Loader streaming job failure with schema inference error

To selectively read a specific type of file using Auto Loader, use the pathGlobFilter option....

Last updated: February 29th, 2024 by harikrishnan.kunhumveettil

Auto Loader streaming query failure with unknownFieldException error

Use schema evolution to avoid streaming query failures when new columns are added to your data....

Last updated: February 29th, 2024 by harikrishnan.kunhumveettil

Casting string to date/timestamp in DLT pipeline does not throw an error

Configure the Delta Live Tables pipeline to enforce ANSI SQL compliance. ...

Last updated: September 23rd, 2024 by anudeep.konaboina

Incorrect numInputRows values even with row limits set per batch

Use endOffset instead of latestOffset to calculate the number of records read. ...

Last updated: September 12th, 2024 by sidhant.sahu

Auto Loader fails to pick up new files when using directory listing mode

Use file notification mode or disable incremental listing....

Last updated: September 12th, 2024 by brock.baurer

Incorrect input record count in Apache Spark streaming application logs/micro-batch metrics

Optimize actions on the DataFrame within the foreachBatch function. ...

Last updated: September 12th, 2024 by potnuru.siva

Upgrading to 14.3 LTS gives the error "com.databricks.sql.cloudfiles.errors.CloudFilesIllegalArgumentException"

Choose to configure either manually or via schema evolution....

Last updated: September 12th, 2024 by lucas.rocha

Streaming application missing data from a Delta table when writing to a given destination

Restart a streaming query on a new checkpoint folder with startingVersion option pointing to the next Delta version (X+1)....

Last updated: September 23rd, 2024 by gopinath.chandrasekaran

DatabricksS3LoggingException errors while attempting to read a binary file in an Apache Spark structured streaming job

Disable the delta format check and ensure correct IAM role setup. ...

Last updated: August 29th, 2024 by kalpesh.shimpi

How to efficiently manage state store files in Apache Spark streaming applications

Control the lifecycle of state store files using streaming configurations...

Last updated: September 10th, 2024 by lingeswaran.radhakrishnan

Auto Loader failures with java.io.FileNotFoundException for SST and log files

Use a separate checkpoint folder outside of the Delta directory....

Last updated: November 4th, 2024 by kuldeep.mishra

Auto Loader job fails saying metadata file is missing from checkpoint directory

Recover or recreate the missing metadata file, and then ensure the checkpoint location is not in a bucket with a lifecycle policy enabled. ...

Last updated: November 6th, 2024 by sidhant.sahu

Stateful Structured Streaming jobs fail after making changes to stateful operations

Avoid changes to stateful operations between restarts....

Last updated: November 17th, 2024 by brock.baurer

Structured Streaming job fails with a Streaming Query Exception when a schema changes in the source table

Enable schema tracking and set allowSourceColumnRenameAndDrop to true....

Last updated: December 2nd, 2024 by shanmugavel.chandrakasu

Duplicates appearing in Auto Loader with file notification feature despite set backfill interval

Remove the allowOverwrites configuration or implement a deduplicate logic....

Last updated: December 12th, 2024 by raul.goncalves

writeStream/readStream leads to an error when the schema contains “NullType”

Replace NullType with a literal None value and then cast it to StringType. ...

Last updated: December 24th, 2024 by G Yashwanth Kiran

Resumed streaming job fails after pause with StreamingQueryException error

Avoid pausing streaming jobs for longer than the delta.logRetentionDuration value, or restart the stream with a new checkpoint location....

Last updated: December 26th, 2024 by jayant.sharma

DLT pipeline is very slow when using Auto Loader and a Glob filter

Configure Auto Loader with file notification mode....

Last updated: March 25th, 2025 by lucas.rocha

Getting an InconsistentReadException error after updating to Databricks Runtime 13.3 LTS or above

Disable file status caching to reduce the time between file status checks. ...

Last updated: January 25th, 2025 by Raphael Freixo

Using glob patterns for directory filtering impacting Auto Loader performance

Use a more specific root path to reduce the scope of the initial scan....

Last updated: January 29th, 2025 by avi.yehuda

[CONCURRENT_QUERY] Error on Auto Loader job

Set the Auto Loader job to be configured to run in "continuous" mode instead of "available now" mode....

Last updated: January 31st, 2025 by Guilherme Leite

Structured Streaming workflow reading data from CDC is failing

Set spark.databricks.streaming.stateStore.stateSchemaCheck.ignoreNullCompatibility to true....

Last updated: January 31st, 2025 by sidhant.sahu

Error when trying to run an Auto Loader job that uses cloudFiles

Set spark.databricks.cloudFiles.checkSourceChanged to false....

Last updated: January 31st, 2025 by sidhant.sahu

Stateful Streaming query failing with SparkSecurityException error after restarting on a shared cluster or serverless

Switch the cluster access mode....

Last updated: February 4th, 2025 by shanmugavel.chandrakasu

Receiving com.databricks.sql.io.FileReadException: Error while reading file on streaming queries

Ensure the delta.deletedFileRetentionDuration value is longer than the time it takes your query to complete....

Last updated: February 7th, 2025 by raphael.balogo

Auto Loader (file notification mode) fails to identify new files from the cloud queue service

Ensure that messages in the cloud queue are of the expected format....

Last updated: March 12th, 2025 by brock.baurer

Unable to use fields with qualifiers in the DLT Apply Changes API

Ensure that qualifiers are extracted prior to reference in the Apply Changes definition....

Last updated: March 19th, 2025 by brock.baurer

Streaming job reading from Kinesis using EFO mode fails with ResourceInUseException error

Set maxFetchDuration to 5 seconds....

Last updated: April 16th, 2025 by anudeep.konaboina

NoSuchMethodError import failure of Protobuf-java when trying to run Apache Spark workflow

Shade the Protobuf class in the JAR....

Last updated: April 24th, 2025 by Raphael Freixo

Streaming job failing with error "org.rocksdb.RocksDBException: Too many open files"

Make changes to Apache Spark configurations applicable to Auto Loader, state store, or both....

Last updated: May 28th, 2025 by jayant.sharma

Structured Streaming does not process batch size reduction after a failed transaction

Delete the uncommitted offset file from the checkpoint location and restart the stream....

Last updated: July 1st, 2025 by Tarun Sanjeev

Timeout error when integrating Kafka with Apache Spark Structured Streaming

Upgrade Kafka brokers....

Last updated: July 10th, 2025 by saritha.shivakumar

How to retrieve DLT pipeline details using Python and the Databricks API

Use the provided code....

Last updated: July 24th, 2025 by anudeep.konaboina

Streaming job failing with “Job terminated with exception” error

Restart the stream with a new checkpoint, or for DLT pipelines do a full refresh....

Last updated: July 24th, 2025 by anudeep.konaboina