Problem
When you try to create a distributed Ray dataset from an Apache Spark DataFrame using the ray.data.from_spark()
function, you encounter the following error.
RuntimeError: In databricks runtime, if you want to use 'ray.data.from_spark' API, you need to set spark cluster config 'spark.databricks.pyspark.dataFrameChunk.enabled' to 'true'.
File <command-602145481410085>, line 3
1 import ray.data
----> 3 ray_dataset = ray.data.from_spark(dataframe)
Cause
The spark.databricks.pyspark.dataFrameChunk.enabled
configuration is set to false
by default.
Solution
Set spark.databricks.pyspark.dataFrameChunk.enabled
to true
to ensure the from_spark()
function works as expected.
- Navigate to your cluster’s configuration page.
- Click the Advanced Options accordion.
- Click the Spark tab.
- In the Spark Config textbox, enter
spark.databricks.pyspark.dataFrameChunk.enabled true
- Click Confirm.