How to retrieve Parquet file metadata

Written by chandan.kumar

Last published at: July 24th, 2025

Introduction

You want to collect metadata from your Parquet files such as total rows, number of row groups, and per-row group details like row count and size. You want to be able to debug, validate, and build robust, scalable data pipelines.

 

Instructions

Use the following code to extract the needed information from files stored on Databricks File System (DBFS) or cloud storage with a mount path. 

 

Pass the path of your file for which you want to extract the total row group, total rows, and details for each row group.

import pyarrow.parquet as pq

# Replace <path-to-parquet-file> with the actual parquet file path.
path = "<path-to-parquet-file>"
parquet_file = pq.ParquetFile(path)
metadata = parquet_file.metadata

print(f"Total Row Groups: {parquet_file.num_row_groups}")
print(f"Total Rows: {metadata.num_rows}")

for i in range(parquet_file.num_row_groups):
    row_group = metadata.row_group(i)
    size_bytes = row_group.total_byte_size  # Get size in bytes

    # Convert bytes to KB, MB, GB
    size_kb = size_bytes / 1024
    size_mb = size_kb / 1024
    size_gb = size_mb / 1024

    print(f"Row Group {i}:")
    print(
        f" - Size: {size_bytes} bytes ({size_kb:.2f} KB, {size_mb:.2f} MB, {size_gb:.4f} GB)"
    )
    print(f" - Rows: {row_group.num_rows}")

 

For information the Parquet file format, refer to the Read Parquet files using Databricks (AWSAzureGCP) documentation. 

 

For information on what Parquet files are, refer to Parquet in the Databricks glossary.