Parquet table counts not being reflected based on concurrent updates

Manually refresh the table in the notebook where the count was initially taken.

Written by ram.sankarasubramanian

Last published at: September 12th, 2024

Problem

You may notice that a Parquet table count within a notebook remains the same even after additional rows are added to the table from an external process.

For instance, if a count is taken from a table (Table 1) in a notebook (Notebook A) and the count is 100, an outside process or another notebook updates Table 1 and adds 100 additional rows. However, if the count is taken again in Notebook A, it continues to show 100 rows.

Cause

Each notebook has a different Apache Spark session, different instances of the Hive client, and different caches, even if two notebooks are attached to the same cluster. When Hive metastore Parquet table conversion is enabled, metadata in those converted tables are also cached.

Solution

Manually refresh the table in the notebook where the count was initially taken. This can be done by executing a refresh table command. For more information on metadata refreshing with Spark, please review the Spark SQL, DataFrames and Datasets Guide

Alternatively, you can use Delta tables instead of Parquet tables to avoid needing to do a manual refresh. For more information on using Delta tables, please review What is Delta Lake? (AWS | Azure | GCP) documentation.