Identify duplicate data on append operations
A common issue when performing append operations on Delta tables is duplicate data. For example, assume user 1 performs a write operation on Delta table A. At the same time, user 2 performs an append operation on Delta table A. This can lead to duplicate records in the table. In this article, we review basic troubleshooting steps that you can use to...
1 min reading timeEnsure consistency in statistics functions between Spark 3.0 and Spark 3.1 and above
Problem The statistics functions covar_samp, kurtosis, skewness, std, stddev, stddev_samp, variance, and var_samp, return NaN when a divide by zero occurs during expression evaluation in Databricks Runtime 7.3 LTS. The same functions return null in Databricks Runtime 9.1 LTS and above, as well as Databricks SQL endpoints when a divide by zero occur...
0 min reading timeOptimize streaming transactions with .trigger
When running a structured streaming application that uses cloud storage buckets (S3, ADLS Gen2, etc.) it is easy to incur excessive transactions as you access the storage bucket. Failing to specify a .trigger option in your streaming code is one common reason for a high number of storage transactions. When a .trigger option is not specified, the sto...
1 min reading timeApache Spark UI is not in sync with job
Problem The status of your Spark jobs is not correctly shown in the Spark UI (AWS | Azure | GCP). Some of the jobs that are confirmed to be in the Completed state are shown as Active/Running in the Spark UI. In some cases the Spark UI may appear blank. When you review the driver logs, you see an AsyncEventQueue warning. Logs ===== 20/12/23 21:20:26 ...
1 min reading timeParsing post meridiem time (PM) with to_timestamp() returns null
Problem You are trying to parse a 12-hour (AM/PM) time value with to_timestamp(), but instead of returning a 24-hour time value it returns null. For example, this sample code: %sql SELECT to_timestamp('2016-12-31 10:12:00 PM', 'yyyy-MM-dd HH:mm:ss a'); Returns null when run: Cause to_timestamp() requires the hour format to be in lowercase. If the ho...
0 min reading timeHyperopt fails with maxNumConcurrentTasks error
Problem You are tuning machine learning parameters using Hyperopt when your job fails with a py4j.Py4JException: Method maxNumConcurrentTasks([]) does not exist error. You are using a Databricks Runtime for Machine Learning (Databricks Runtime ML) cluster. Cause Databricks Runtime ML has a compatible version of Hyperopt pre-installed (AWS | Azure | ...
0 min reading time