Introduction
You want to collect metadata from your Parquet files such as total rows, number of row groups, and per-row group details like row count and size. You want to be able to debug, validate, and build robust, scalable data pipelines.
Instructions
Use the following code to extract the needed information from files stored on Databricks File System (DBFS) or cloud storage with a mount path.
Pass the path of your file for which you want to extract the total row group, total rows, and details for each row group.
import pyarrow.parquet as pq
# Replace <path-to-parquet-file> with the actual parquet file path.
path = "<path-to-parquet-file>"
parquet_file = pq.ParquetFile(path)
metadata = parquet_file.metadata
print(f"Total Row Groups: {parquet_file.num_row_groups}")
print(f"Total Rows: {metadata.num_rows}")
for i in range(parquet_file.num_row_groups):
row_group = metadata.row_group(i)
size_bytes = row_group.total_byte_size # Get size in bytes
# Convert bytes to KB, MB, GB
size_kb = size_bytes / 1024
size_mb = size_kb / 1024
size_gb = size_mb / 1024
print(f"Row Group {i}:")
print(
f" - Size: {size_bytes} bytes ({size_kb:.2f} KB, {size_mb:.2f} MB, {size_gb:.4f} GB)"
)
print(f" - Rows: {row_group.num_rows}")
For information the Parquet file format, refer to the Read Parquet files using Databricks (AWS | Azure | GCP) documentation.
For information on what Parquet files are, refer to Parquet in the Databricks glossary.