Error PySparkNotImplementedError when using an RDD to extract distinct values on a standard cluster

Use .collect() and list comprehension to extract distinct column values.

Written by anshuman.sahu

Last published at: April 14th, 2025

Problem

When you try to extract distinct values from a PySpark DataFrame using a resilient distributed dataset (RDD) on a standard cluster, you receive a PySpark not implemented error.

 

Example code

df.select("column_1").distinct().rdd.flatMap(lambda x: x).collect()

 

Example error message

PySparkNotImplementedError: [NOT_IMPLEMENTED] rdd is not implemented.

 

Cause

Apache Spark RDD APIs are not supported in standard (formerly shared) access mode clusters. 

For more information, refer to the Spark API limitations and requirements for Unity Catalog standard access mode section of the Compute access mode limitations for Unity Catalog (AWSAzureGCP) documentation.

 

Solution

In your standard access mode cluster, use the .collect() method directly on the distinct DataFrame. This retrieves all the distinct rows in a list.

Then apply a list comprehension to extract the column_1 values from the row objects.

rows = df.select("column_1").distinct().collect()
distinct_column_1 = [row.column_1 for row in rows]