PyArrow hotfix breaking change

PyArrow versions 0.14 - 14.0.0 contain a security vulnerability.

Written by Adam Pavlacka

Last published at: December 6th, 2023

Problem

On Dec 5th, 2023, Databricks rolled out a security update to all supported Databricks Runtime versions to address a critical vulnerability (CVE-2023-47248) from the PyArrow python package embedded in Databricks Runtime.

With this update, Databricks packages and automatically activates pyarrow-hotfix as remediation. The PyArrow community recommends this solution. This change turns off the vulnerable feature from PyArrow that is not used by Databricks Runtime by default. However, this change could break some workloads customized to use the vulnerable feature from PyArrow.

This security update is only applied to PyArrow embedded in Databricks Runtime versions, but not in cases where you installed your version of PyArrow on a Databricks cluster.

Impact

  • There is no impact if your workloads don’t use the PyArrow extension datatype pyarrow.PyExtensionType.
  • There is no impact if your workloads use the secure PyArrow extension datatype, pyarrow.ExtensionType.
  • If your workloads use the vulnerable PyArrow extension datatype pyarrow.PyExtensionType, then they will fail with the following error message:
Found disallowed extension datatype (arrow.py_extension_type), please check   
https://kb.databricks.com/pyarrow-hotfix-breaking-change for helps.  

Original error message from pyarrow-hotfix:  
Disallowed deserialization of 'arrow.py_extension_type':  
......

Solution

You can fix the workload by completing one of the following actions.

Databricks recommends performing a code change to remediate the vulnerability.

If you use pyarrow.PyExtensionType to process Parquet files, change your Parquet files and data processing code to use the secure API, pyarrow.ExtensionType instead of pyarrow.PyExtensionType. This long-term solution requires changes to the code or process that produces the Parquet files.

Option 2: Turn off the security update

If you cannot perform the code changes and you trust the provider of the Parquet files, you can temporarily turn off the security update at your own risk. This solution is a temporary remediation, available to Databricks Runtime versions 14.2 and below.

Set the following environment variable to turn off the automatic activation of the security update by setting the following environment variable (AWS | Azure | GCP) in the configurations of all affected clusters. 

DATABRICKS_DISABLE_AUTO_PYARROW_HOTFIX=True

Turning off the automatic activation re-enables PyArrow’s ability to work with parquet files that use the vulnerable API, pyarrow.PyExtensionType

Help me choose

Do you use extension datatype in PyArrow in your workload?

Yes

 

No

 
 

Do you use pyarrow.PyExtensionType (insecure)?

Yes

 

No

 
 

 Are you comfortable migrating to pyarrow.ExtensionType (secure)?

Yes

 

No, I want to DISABLE the security update.

 
 

You should follow the steps in Option 1: Code change.

 

Do you trust the parquet file provider?

Yes

 

No

 
 

You should follow the steps in Option 2: Turn off the security update.

 

Your system is in an UNSAFE state. This is not recommended.

 

This issue does not impact you.

 

This issue does not impact you.