Problem
When attempting to read .gzip
files from S3 using Apache Spark in the Data Engineering environment, you may find the compressed values being read instead of the decompressed data.
The error message or symptom includes seeing compressed values like ��% eb���[�.K����Qh�q�h.
Further, you may find methods such as .option("compression", "gzip")
and installing libraries like org.apache.hadoop.io.compress.GzipCodec
do not help.
Cause
Spark relies on file extensions to determine the compression type via the getDefaultExtension()
method. Spark expects the file extension to be .gz
but files in the S3 location have a .gzip
extension instead.
Solution
Rename the files in S3 from .gzip
to .gz
. This will allow Spark to correctly identify the compression type and decompress the files.
If it is not possible to rename files individually, consider creating a custom library (JAR) to override GzipCodec
to handle the .gzip
extension.
Create a new codec. You can create a small JAR and attach it to the cluster.
package com.databricks.gzipcodectransformer #package created for this specific transformation
import org.apache.hadoop.io.compress.GzipCodec
class MyGzipCodec extends GzipCodec {
override def getDefaultExtension(): String = "gzip"
}
There are then two ways to configure your custom codec to read compressed files.
Locally, for each DataFrame (preferred):
spark.read.option("io.compression.codecs", "com.databricks.gzipcodectransformer.MyGzipCodec").json("/path")
Globally, in a notebook:
spark.conf.set("io.compression.codecs", "com.databricks.gzipcodectransformer.MyGzipCodec")