Updated February 29th, 2024 by harikrishnan.kunhumveettil

Auto Loader streaming job failure with schema inference error

Problem You have an Apache Spark streaming job using Auto Loader encounter an error stating: Schema inference for the 'parquet' format from the existing files in the input path <Root Folder> has failed Cause One possible cause for this issue is having multiple types of files in the child directories. The input directory structure includes a ro...

0 min reading time
Updated May 24th, 2022 by harikrishnan.kunhumveettil

JDBC write fails with a PrimaryKeyViolation error

Problem You are using JDBC to write to a SQL table that has primary key constraints, and the job fails with a PrimaryKeyViolation error. Alternatively, you are using JDBC to write to a SQL table that does not have primary key constraints, and you see duplicate entries in recently written tables. Cause When Apache Spark performs a JDBC write, one par...

0 min reading time
Updated January 19th, 2024 by harikrishnan.kunhumveettil

Autoloader job fails with a URISyntaxException error due to invalid characters in filenames

Problem You have an Autoloader job configured in Directory listing mode and are encountering a failure with a URISyntaxException error. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: [masked_uri] Cause The error message indicates an issue with the URI (Uniform Resource Identifier) used in the Autoload...

0 min reading time
Updated February 29th, 2024 by harikrishnan.kunhumveettil

Auto Loader streaming query failure with unknownFieldException error

Problem Your Auto Loader streaming job fails with an UnknownFieldException error when a new column is added to the source file of the stream. Exception: org.apache.spark.sql.catalyst.util.UnknownFieldException: Encountered unknown field(s) during parsing: <column name> Cause An UnknownFieldException error occurs when Auto Loader detects the ad...

0 min reading time
Updated May 16th, 2022 by harikrishnan.kunhumveettil

display() does not show microseconds correctly

Problem You want to display a timestamp value with microsecond precision, but when you use display() it does not show the value past milliseconds. For example, this Apache Spark SQL display() command: %sql display(spark.sql("select cast('2021-08-10T09:08:56.740436' as timestamp) as test")) Returns a truncated value: 2021-08-10T09:08:56.740+0000 Caus...

0 min reading time
Updated December 8th, 2022 by harikrishnan.kunhumveettil

Custom garbage collection prevents cluster launch

Problem You are trying to use a custom Apache Spark garbage collection algorithm (other than the default one (parallel garbage collection) on clusters running Databricks Runtime 10.0 and above. When you try to start a cluster, it fails to start. If the configuration is set on an executor, the executor is immediately terminated. For example, if you s...

0 min reading time
Updated January 18th, 2024 by harikrishnan.kunhumveettil

Stream to stream join failure

Problem You are encountering an error when attempting to display a streaming DataFrame that is derived by performing a stream-stream join.  Cause When calling the display method on a structured streaming DataFrame, the default settings utilize complete output mode and a memory sink. However, it's important to note that for stream-stream joins, the c...

0 min reading time
Updated May 10th, 2022 by harikrishnan.kunhumveettil

Job fails, but Apache Spark tasks finish

Problem Your Databricks job reports a failed status, but all Spark jobs and tasks have successfully completed. Cause You have explicitly called spark.stop() or System.exit(0) in your code. If either of these are called, the Spark context is stopped, but the graceful shutdown and handshake with the Databricks job service does not happen. Solution Do ...

0 min reading time
Updated January 19th, 2024 by harikrishnan.kunhumveettil

Offset reprocessing issues in streaming queries with a Kafka source

Problem You are using Apache Spark Structured Streaming to source data from a Kafka topic and write it to a Delta table sink, but challenges arise when attempting to reprocess data from the earliest offset in the topic. The stream is appropriately updated with the option "startingOffsets": "earliest" and restarted. However, the streaming query fails...

0 min reading time
Load More