Create tables on JSON datasets
In this article we cover how to create a table on JSON datasets using SerDe. Download the JSON SerDe JAR Open the hive-json-serde 1.3.8 download page. Click on json-serde-1.3.8-jar-with-dependencies.jar to download the file json-serde-1.3.8-jar-with-dependencies.jar. Info You can review the Hive-JSON-Serde GitHub repo for more information on the JAR...
Delete table when underlying S3 bucket is deleted
Problem You are trying to drop or alter a table when you get an error. Error in SQL statement: IOException: Bucket_name … does not exist You can reproduce the error with a DROP TABLE or ALTER TABLE command. %sql DROP TABLE <database-name.table-name>; %sql ALTER TABLE <database-name.table-name> SET LOCATION "<file-system-location>";...
Failure when mounting or accessing Azure Blob storage
Problem When you try to access an already created mount point or create a new mount point, it fails with the error: WASB: Fails with java.lang.NullPointerException Cause This error can occur when the root mount path (such as /mnt/) is also mounted to blob storage. Run the following command to check if the root path is also mounted: %python dbutils.f...
Unable to read files and list directories in a WASB filesystem
Problem When you try reading a file on WASB with Spark, you get the following exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 19, 10.139.64.5, executor 0): shaded.databricks.org.apache.hadoop.fs.azure.AzureException: com.microsoft.a...
Optimize read performance from JDBC data sources
Problem Reading data from an external JDBC database is slow. How can I improve read performance? Solution See the detailed discussion in the Databricks documentation on how to optimize performance when reading data (AWS | Azure | GCP) from an external JDBC database....
Troubleshooting JDBC/ODBC access to Azure Data Lake Storage Gen2
Problem Info In general, you should use Databricks Runtime 5.2 and above, which include a built-in Azure Blob File System (ABFS) driver, when you want to access Azure Data Lake Storage Gen2 (ADLS Gen2). This article applies to users who are accessing ADLS Gen2 storage using JDBC/ODBC instead. When you run a SQL query from a JDBC or ODBC client to ac...
CosmosDB-Spark connector library conflict
This article explains how to resolve an issue running applications that use the CosmosDB-Spark connector in the Databricks environment. Problem Normally if you add a Maven dependency to your Spark cluster, your app should be able to use the required connector libraries. But currently, if you simply specify the CosmosDB-Spark connector’s Maven co-ord...
Failure to detect encoding in JSON
Problem Spark job fails with an exception containing the message: Invalid UTF-32 character 0x1414141(above 10ffff) at char #1, byte #7) At org.apache.spark.sql.catalyst.json.JacksonParser.parse Cause The JSON data source reader is able to automatically detect encoding of input JSON files using BOM at the beginning of the files. However, BOM is not ...
Inconsistent timestamp results with JDBC applications
Problem When using JDBC applications with Databricks clusters you see inconsistent java.sql.Timestamp results when switching between standard time and daylight saving time. Cause Databricks clusters use UTC by default. java.sql.Timestamp uses the JVM’s local time zone. If a Databricks cluster returns 2021-07-12 21:43:08 as a string, the JVM parses i...
Kafka client terminated with OffsetOutOfRangeException
Problem You have an Apache Spark application that is trying to fetch messages from an Apache Kafka source when it is terminated with a kafkashaded.org.apache.kafka.clients.consumer.OffsetOutOfRangeException error message. Cause Your Spark application is trying to fetch expired data offsets from Kafka. We generally see this in these two scenarios: Sc...
Apache Spark JDBC datasource query option doesn’t work for Oracle database
Problem When you use the query option with the Apache Spark JDBC datasource to connect to an Oracle Database, it fails with this error: java.sql.SQLSyntaxErrorException: ORA-00911: invalid character For example, if you run the following to make a JDBC connection: %scala val df = spark.read .format("jdbc") .option("url", "<url>") .option(...
Accessing Redshift fails with NullPointerException
Problem Sometimes when you read a Redshift table: %scala val original_df = spark.read. format("com.databricks.spark.redshift"). option("url", url). option("user", user). option("password", password). option("query", query). option("forward_spark_s3_credentials", true). option("tempdir", "path"). load()...
Redshift JDBC driver conflict issue
Problem If you attach multiple Redshift JDBC drivers to a cluster, and use the Redshift connector, the notebook REPL might hang or crash with a SQLDriverWrapper error message. 19/11/14 01:01:44 ERROR SQLDriverWrapper: Fatal non-user error thrown in ReplId-9d455-9b970-b2042 java.lang.NoSuchFieldError: PG_SUBPROTOCOL_NAMES at com.amazon.redshi...
ABFS client hangs if incorrect client ID or wrong path used
Problem You are using Azure Data Lake Storage (ADLS) Gen2. When you try to access an Azure Blob File System (ABFS) path from a Databricks cluster, the command hangs. Enable the debug log and you can see the following stack trace in the driver logs: Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: https://login.microso...