Problem
You may need to find the number of files per partition in a Delta table to monitor partition sizes, optimize partitions, combine small files, and so on.
Cause
In Delta Lake, knowing the number of files per partition in a Delta table is useful for cost management reasons.
- Storage Costs: Excessive small files increase storage costs because of additional metadata and inefficient storage utilization.
- Compute Costs: Too many files can increase compute costs as Spark needs to manage and read more files, leading to higher job execution times and cluster costs.
Solution
You can find the number of files per partition with the provided example code. Replace <path-to-delta-table>
with the full path to your Delta table before running the example code in a notebook.
Python example code
This snippet retrieves the partition structure of a Delta table and displays the number of data files per partition value to help assess file distribution and potential optimization needs.
%python
#Fetch the schema of the partitions
schema = spark.sql("show partitions delta.`dbfs:/<path-to-delta-table>`").schema
#Extract the column names
column_names = [field.name for field in schema.fields]
# Run the SQL query to count distinct file paths grouped by the first column
display(spark.sql(f"select {column_names[0]}, count(distinct _metadata.file_path) from delta.`dbfs:/<path-to-delta-table>` group by {column_names[0]}" ))
Example results
+---+-----------------------------------+
| id|count(DISTINCT _metadata.file_path)|
+---+-----------------------------------+
| 5| 1|
| 1| 1|
| 3| 1|
| 2| 1|
| 4| 1|
+---+-----------------------------------+
Scala example code
This snippet retrieves the partition structure of a Delta table and displays the number of data files per partition value to help assess file distribution and potential optimization needs.
%scala
// Fetch the schema of the partitions
val schema = spark.sql("SHOW PARTITIONS delta.`dbfs:/<path-to-delta-table>`").schema
// Extract the column names
val columnNames = schema.fields.map(_.name)
// Run the SQL query to count distinct file paths grouped by the first column
val query = s"SELECT ${columnNames(0)}, COUNT(DISTINCT _metadata.file_path) FROM delta.`dbfs:/<path-to-delta-table>` GROUP BY ${columnNames(0)}"
val result = spark.sql(query)
// Display the result
result.show()
Example results
+---+-----------------------------------+
| id|count(DISTINCT _metadata.file_path)|
+---+-----------------------------------+
| 5| 1|
| 1| 1|
| 3| 1|
| 2| 1|
| 4| 1|
+---+-----------------------------------+