Problem
When you try to extract distinct values from a PySpark DataFrame using a resilient distributed dataset (RDD) on a standard cluster, you receive a PySpark not implemented error.
Example code
df.select("column_1").distinct().rdd.flatMap(lambda x: x).collect()
Example error message
PySparkNotImplementedError: [NOT_IMPLEMENTED] rdd is not implemented.
Cause
Apache Spark RDD APIs are not supported in standard (formerly shared) access mode clusters.
For more information, refer to the Spark API limitations and requirements for Unity Catalog standard access mode section of the Compute access mode limitations for Unity Catalog (AWS | Azure | GCP) documentation.
Solution
In your standard access mode cluster, use the .collect()
method directly on the distinct DataFrame. This retrieves all the distinct rows in a list.
Then apply a list comprehension to extract the column_1
values from the row objects.
rows = df.select("column_1").distinct().collect()
distinct_column_1 = [row.column_1 for row in rows]