Problem
You are trying to initialize Apache Spark Context in Databricks Apps using getActiveSession()
or builder.appName("APP").getOrCreate()
to query or pull the data from Databricks tables, but you keep getting Spark Context or Java gateway error messages.
File "/app/python/source_code/.venv/lib/python3.11/site-packages/pyspark/sql/session.py", line 477, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/python/source_code/.venv/lib/python3.11/site-packages/pyspark/context.py", line 512, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/app/python/source_code/.venv/lib/python3.11/site-packages/pyspark/context.py", line 198, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/app/python/source_code/.venv/lib/python3.11/site-packages/pyspark/context.py", line 432, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
^^^^^^^^^^^^^^^^^^^^
File "/app/python/source_code/.venv/lib/python3.11/site-packages/pyspark/java_gateway.py", line 106, in launch_gateway
raise RuntimeError("Java gateway process exited before sending its port number")
Cause
Databricks Apps do not support Spark Context or JVM-based operations. Attempts to initialize Spark Context within an App fail, leading to the "Java gateway process exited before sending its port number"
error. This is expected behavior.
Solution
Instead of using Spark Context, you should utilize the Databricks SQL Connector for Python (AWS | Azure | GCP). This connector enables interaction with Databricks tables using SQL queries via Databricks clusters and SQL warehouses.
For more information, you can also review the Databricks SQL Connector for Python on PyPI documentation.
Example code
You must set the following variables before using this code snippet.
-
<databricks-host>
is the Server Hostname for your cluster or SQL warehouse. -
<http-path>
is the HTTP Path value for your cluster or SQL warehouse. -
<access-token>
is your Databricks personal access token.
Info
You can find the hostname and path values for a cluster in the JDBC/ODBC tab in the Advanced options on a cluster’s configuration page. You can find the same values for a SQL warehouse in the Connection details tab on the warehouse’s configuration page.
%python
from databricks import sql
connection = sql.connect(
server_hostname = "<databricks-host>",
http_path = "<http-path>",
access_token = "<access-token>"
)
cursor = connection.cursor()
cursor.execute("SELECT * FROM my_table LIMIT 10")
results = cursor.fetchall()
for row in results:
print(row)
cursor.close()
connection.close()