Apache Spark session is null in DBConnect
Problem You are trying to run your code using Databricks Connect ( AWS | Azure | GCP ) when you get a sparkSession is null error message. java.lang.AssertionError: assertion failed: sparkSession is null while trying to executeCollectResult at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(...
Databricks Connect reports version error with Databricks Runtime 6.4
Problem You are using the Databricks Connect client with Databricks Runtime 6.4 and receive an error message which states that the client does not support the cluster. Caused by: java.lang.IllegalArgumentException: The cluster is running server version `dbr-6.4` but this client only supports Set(dbr-5.5). You can find a list of client releases at ht...
Failed to create process error with Databricks CLI in Windows
Problem While trying to access the Databricks CLI (AWS | Azure | GCP) in Windows, you get a failed to create process error message. Cause This can happen: If multiple instances of the Databricks CLI are installed on the system. If the Python path on your Windows system includes a space. Info There is a known issue in pip which causes pip installed s...
GeoSpark undefined function error with DBConnect
Problem You are trying to use the GeoSpark function st_geofromwkt with DBConnect (AWS | Azure | GCP) and you get an Apache Spark error message. Error: org.apache.spark.sql.AnalysisException: Undefined function: 'st_geomfromwkt'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; T...
Get Apache Spark config in DBConnect
You can always view the Spark configuration (AWS | Azure | GCP) for your cluster by reviewing the cluster details in the workspace. If you are using DBConnect (AWS | Azure | GCP) you may want to quickly review the current Spark configuration details without switching over to the workspace UI. This example code shows you how to get the current Spark ...
How to Sort S3 files By Modification Time in Databricks Notebooks
Problem When you use the dbutils utility to list the files in a S3 location, the S3 files list in random order. However, dbutils doesn’t provide any method to sort the files based on their modification time. dbutils doesn’t list a modification time either. Solution Use the Hadoop filesystem API to sort the S3 files, as shown here: %scala import org....
Invalid Access Token error when running jobs with Airflow
Problem When you run scheduled Airflow Databricks jobs, you get this error: Invalid Access Token : 403 Forbidden Error Cause To run or schedule Databricks jobs through Airflow, you need to configure the Databricks connection using the Airflow web UI. Any of the following incorrect settings can cause the error: Set the host field to the Databricks wo...
ProtoSerializer stack overflow error in DBConnect
Problem You are using DBConnect (AWS | Azure | GCP) to run a PySpark transformation on a DataFrame with more than 100 columns when you get a stack overflow error. py4j.protocol.Py4JJavaError: An error occurred while calling o945.count. : java.lang.StackOverflowError at java.lang.Class.getEnclosingMethodInfo(Class.java:1072) at java.lang.Clas...
Use tcpdump to create pcap files
If you want to analyze the network traffic between nodes on a specific cluster, you can install tcpdump on the cluster and use it to dump the network packet details to pcap files. The pcap files can then be downloaded to a local machine for analysis. Create the tcpdump init script Run this sample script in a notebook on the cluster to create the ini...