Apache Spark JDBC datasource query option doesn’t work for Oracle database

Learn how to resolve an error that occurs when using the Apache Spark JDBC datasource to connect to Oracle Database from Databricks.

Written by Adam Pavlacka

Last published at: June 1st, 2022


Problem

When you use the query option with the Apache Spark JDBC datasource to connect to an Oracle Database, it fails with this error:

java.sql.SQLSyntaxErrorException: ORA-00911: invalid character

For example, if you run the following to make a JDBC connection:

%scala

val df = spark.read
  .format("jdbc")
  .option("url", "<url>")
  .option("query", "SELECT * FROM oracle_test_table)")
  .option("user", "<user>")
  .option("password", "<password>")
  .option("driver", "oracle.jdbc.driver.OracleDriver")
  .load()
df.show()

You will see this error Message:

java.sql.SQLSyntaxErrorException: ORA-00911: invalid character
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447)
      at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396)
      at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:951)
      at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513)

Cause

The error is due to a Spark-generated subquery alias (generated with the query option) that does not conform to Oracle Database identifier naming conventions. This bug is tracked in Spark Jira ticket SPARK-27596.

Solution

This issue is fixed in Apache Spark 2.4.4 and Databricks Runtime 5.4.

For clusters running on earlier versions of Spark or Databricks Runtime, use the dbtable option instead of the query option. The query must be enclosed in parentheses as a subquery.

%scala

val df = spark.read
  .format("jdbc")
  .option("url", "<url>")
  .option("dbtable", "(SELECT * FROM oracle_test_table)")
  .option("user", "<user>")
  .option("password", "<password>")
  .option("driver", "oracle.jdbc.driver.OracleDriver")
  .load()
df.show()

You can try the same workaround for other databases when the query option fails.