Problem
You build a JAR file defining a custom Kryo serializer and install the file in your cluster's libraries using the API or the UI. Then you add the spark.serializer org.apache.spark.serializer.KryoSerializer
and spark.kryo.registrator <your-custom-kryo-class>
Apache Spark properties to your cluster's configuration.
When you then try to execute a job or notebook, it fails with a ClassNotFoundException
error.
Cause
When you install a custom library and define a custom serializer on a Databricks cluster, the library is only installed on the driver, not the executor.
However, the library is made available for the executor to install, and the executor tries to do so when the first task that needs the library is run on the executor.
Solution
1. Instead of installing a library on the cluster using the configurations, upload the JAR file with the custom Kryo classes to your workspace file system or volume.
2. Create the below init script using the JAR file path from the previous step.
#!/bin/sh
cp <your-jar-file-path> /databricks/jars/
3. Add the init script from the previous step to the cluster configurations under the Advanced options > Init Scripts tab.
4. Make sure the custom Kryo serializer configuration is still in place. In the same Advanced options space, click the Spark tab and verify your code is in the Spark config box.
5. Restart the cluster.