Apache Spark reading .gzip files from S3 instead of decompressed data

Rename the files in S3 from .gzip to .gz.

Written by kuldeep.mishra

Last published at: September 12th, 2024

Problem

When attempting to read .gzip files from S3 using Apache Spark in the Data Engineering environment, you may find the compressed values being read instead of the decompressed data.

The error message or symptom includes seeing compressed values like ��% eb���[�.K����Qh�q�h.

Further, you may find methods such as .option("compression", "gzip") and installing libraries like org.apache.hadoop.io.compress.GzipCodec do not help.

Cause

Spark relies on file extensions to determine the compression type via the getDefaultExtension() method. Spark expects the file extension to be .gz but files in the S3 location have a .gzip extension instead.

Solution

Rename the files in S3 from .gzip to .gz. This will allow Spark to correctly identify the compression type and decompress the files.

If it is not possible to rename files individually, consider creating a custom library (JAR) to override GzipCodec to handle the .gzip extension.

Create a new codec. You can create a small JAR and attach it to the cluster.

package com.databricks.gzipcodectransformer #package created for this specific transformation
import org.apache.hadoop.io.compress.GzipCodec
class MyGzipCodec extends GzipCodec {
override def getDefaultExtension(): String = "gzip"
}

There are then two ways to configure your custom codec to read compressed files.

Locally, for each DataFrame (preferred): 

spark.read.option("io.compression.codecs", "com.databricks.gzipcodectransformer.MyGzipCodec").json("/path")

Globally, in a notebook: 

spark.conf.set("io.compression.codecs", "com.databricks.gzipcodectransformer.MyGzipCodec")