Problem
In an interactive cluster, you notice when you unpersist DataFrames cached in df.cache()
by calling df.unpersist()
, the old data continues to be referenced.
Cause
The unpersist()
function is a non-blocking call by default. The following function marks the DataFrame as non-persistent, and removes all blocks for it from memory and disk.
DataFrame.unpersist(blocking: bool = False) → pyspark.sql.dataframe.DataFrame
However, both unpersist()
and unpersist(blocking = False)
still allow other operations until memory pressure forces block deletion.
Solution
In your code, set unpersist(blocking=True)
to ensure the unpersist()
operation completes before proceeding with further actions. This will ensure that the cached data is fully cleared from memory before any subsequent actions are performed.
For more information, refer to the pyspark.sql.DataFrame.unpersist documentation.