How to run SQL queries from Python scripts

Learn how to run SQL queries using Python scripts.

Written by arjun.kaimaparambilrajan

Last published at: May 19th, 2022

You may want to access your tables outside of Databricks notebooks. Besides connecting BI tools via JDBC (AWS | Azure), you can also access tables by using Python scripts. You can connect to a Spark cluster via JDBC using PyHive and then run a script. You should have PyHive installed on the machine where you are running the Python script.

Delete

Info

Python 2 is considered end-of-life. You should use Python 3 to run the script provided in this article. If you have both Python 2 and Python 3 running on your system, you should make sure your version of pip is linked to Python 3 before you proceed.

You can check your version of pip by running pip -V at the command prompt. This command returns the version of pip and the version of Python it is using.

Install PyHive and Thrift

Use pip to install PyHive and Thrift.

%sh

pip install pyhive thrift

Run SQL script

This sample Python script sends the SQL query show tables to your cluster and then displays the result of the query.

Do the following before you run the script:

  1. Replace <token> with your Databricks API token.
  2. Replace <databricks-instance> with the domain name of your Databricks deployment.
  3. Replace <workspace-id> with the Workspace ID.
  4. Replace <cluster-id> with a cluster ID.

To get the API token, see Generate a token (AWS | Azure). To determine the other values, see How to get Workspace, Cluster, Notebook, and Job Details (AWS | Azure).

%python

#!/usr/bin/python


import os
import sys
from pyhive import hive
from thrift.transport import THttpClient
import base64


TOKEN = "<token>"
WORKSPACE_URL = "<databricks-instance>"
WORKSPACE_ID = "<workspace-id>"
CLUSTER_ID = "<cluster-id>"


conn = 'https://%s/sql/protocolv1/o/%s/%s' % (WORKSPACE_URL, WORKSPACE_ID, CLUSTER_ID)
print(conn)


transport = THttpClient.THttpClient(conn)


auth = "token:%s" % TOKEN
PY_MAJOR = sys.version_info[0]


if PY_MAJOR < 3:
  auth = base64.standard_b64encode(auth)
else:
  auth = base64.standard_b64encode(auth.encode()).decode()


transport.setCustomHeaders({"Authorization": "Basic %s" % auth})


cursor = hive.connect(thrift_transport=transport).cursor()


cursor.execute('show tables',async_=True)


pending_states = (
        hive.ttypes.TOperationState.INITIALIZED_STATE,
        hive.ttypes.TOperationState.PENDING_STATE,
        hive.ttypes.TOperationState.RUNNING_STATE)


while cursor.poll().operationState in pending_states:
    print("Pending...")


print("Done. Results:")


for table in cursor.fetchall():
    print(table)


Was this article helpful?