You may want to access your tables outside of Databricks notebooks. Besides connecting BI tools via JDBC (AWS | Azure), you can also access tables by using Python scripts. You can connect to a Spark cluster via JDBC using PyHive and then run a script. You should have PyHive installed on the machine where you are running the Python script.
Install PyHive and Thrift
Use pip to install PyHive and Thrift.
%sh pip install pyhive thrift
Run SQL script
This sample Python script sends the SQL query show tables to your cluster and then displays the result of the query.
Do the following before you run the script:
- Replace <token> with your Databricks API token.
- Replace <databricks-instance> with the domain name of your Databricks deployment.
- Replace <workspace-id> with the Workspace ID.
- Replace <cluster-id> with a cluster ID.
To get the API token, see Generate a token (AWS | Azure). To determine the other values, see How to get Workspace, Cluster, Notebook, and Job Details (AWS | Azure).
%python #!/usr/bin/python import os import sys from pyhive import hive from thrift.transport import THttpClient import base64 TOKEN = "<token>" WORKSPACE_URL = "<databricks-instance>" WORKSPACE_ID = "<workspace-id>" CLUSTER_ID = "<cluster-id>" conn = 'https://%s/sql/protocolv1/o/%s/%s' % (WORKSPACE_URL, WORKSPACE_ID, CLUSTER_ID) print(conn) transport = THttpClient.THttpClient(conn) auth = "token:%s" % TOKEN PY_MAJOR = sys.version_info[0] if PY_MAJOR < 3: auth = base64.standard_b64encode(auth) else: auth = base64.standard_b64encode(auth.encode()).decode() transport.setCustomHeaders({"Authorization": "Basic %s" % auth}) cursor = hive.connect(thrift_transport=transport).cursor() cursor.execute('show tables',async_=True) pending_states = ( hive.ttypes.TOperationState.INITIALIZED_STATE, hive.ttypes.TOperationState.PENDING_STATE, hive.ttypes.TOperationState.RUNNING_STATE) while cursor.poll().operationState in pending_states: print("Pending...") print("Done. Results:") for table in cursor.fetchall(): print(table)