DataFrame in an interactive cluster still showing cached data after calling unpersist() function

Use unpersist(blocking=True) to ensure unpersist() is performed before proceeding with further actions.

Written by MuthuLakshmi.AN

Last published at: March 21st, 2025

Problem

In an interactive cluster, you notice when you unpersist DataFrames cached in df.cache() by calling df.unpersist(), the old data continues to be referenced. 

 

Cause

The unpersist() function is a non-blocking call by default. The following function marks the DataFrame as non-persistent, and removes all blocks for it from memory and disk. 

DataFrame.unpersist(blocking: bool = False) → pyspark.sql.dataframe.DataFrame

 

However, both unpersist() and unpersist(blocking = False) still allow other operations until memory pressure forces block deletion.  

 

Solution

In your code, set unpersist(blocking=True) to ensure the unpersist() operation completes before proceeding with further actions. This will ensure that the cached data is fully cleared from memory before any subsequent actions are performed.

 

For more information, refer to the pyspark.sql.DataFrame.unpersist documentation.