Problem
On Dec 5th, 2023, Databricks rolled out a security update to all supported Databricks Runtime versions to address a critical vulnerability (CVE-2023-47248) from the PyArrow python package embedded in Databricks Runtime.
With this update, Databricks packages and automatically activates pyarrow-hotfix as remediation. The PyArrow community recommends this solution. This change turns off the vulnerable feature from PyArrow that is not used by Databricks Runtime by default. However, this change could break some workloads customized to use the vulnerable feature from PyArrow.
This security update is only applied to PyArrow embedded in Databricks Runtime versions, but not in cases where you installed your version of PyArrow on a Databricks cluster.
Impact
- There is no impact if your workloads don’t use the PyArrow extension datatype pyarrow.PyExtensionType.
- There is no impact if your workloads use the secure PyArrow extension datatype, pyarrow.ExtensionType.
- If your workloads use the vulnerable PyArrow extension datatype pyarrow.PyExtensionType, then they will fail with the following error message:
Found disallowed extension datatype (arrow.py_extension_type), please check
https://kb.databricks.com/pyarrow-hotfix-breaking-change for helps.
Original error message from pyarrow-hotfix:
Disallowed deserialization of 'arrow.py_extension_type':
......
Solution
You can fix the workload by completing one of the following actions.
Databricks recommends performing a code change to remediate the vulnerability.
Option 1: Code change (recommended)
If you use pyarrow.PyExtensionType to process Parquet files, change your Parquet files and data processing code to use the secure API, pyarrow.ExtensionType instead of pyarrow.PyExtensionType. This long-term solution requires changes to the code or process that produces the Parquet files.
Option 2: Turn off the security update
If you cannot perform the code changes and you trust the provider of the Parquet files, you can temporarily turn off the security update at your own risk. This solution is a temporary remediation, available to Databricks Runtime versions 14.2 and below.
Set the following environment variable to turn off the automatic activation of the security update by setting the following environment variable (AWS | Azure | GCP) in the configurations of all affected clusters.
DATABRICKS_DISABLE_AUTO_PYARROW_HOTFIX=True
Turning off the automatic activation re-enables PyArrow’s ability to work with parquet files that use the vulnerable API, pyarrow.PyExtensionType.
Help me choose
Do you use extension datatype in PyArrow in your workload?
Do you use pyarrow.PyExtensionType (insecure)?
Are you comfortable migrating to pyarrow.ExtensionType (secure)?
You should follow the steps in Option 1: Code change.
Do you trust the parquet file provider?
You should follow the steps in Option 2: Turn off the security update.
Your system is in an UNSAFE state. This is not recommended.
This issue does not impact you.
This issue does not impact you.