How to Dump Tables in CSV, JSON, XML, Text, or HTML Format

You want to send results of your computations in Databricks outside Databricks. You can use BI tools to connect to your cluster via JDBC and export results from the BI tools, or save your tables in DBFS or blob storage and copy the data via REST API.

This topic introduces JSpark, a simple console tool for executing SQL queries using JDBC on Spark clusters to dump remote tables to local disk in CSV, JSON, XML, Text, and HTML format. For example:

java -Dconfig.file=mycluster.conf -jar jspark.jar -q "select id, type, priority, status from tickets limit 5"

returns:

+----+--------+--------+------+
|  id|type    |priority|status|
+----+--------+--------+------+
|9120|problem |urgent  |closed|
|9121|question|normal  |hold  |
|9122|incident|normal  |closed|
|9123|question|normal  |open  |
|9124|incident|normal  |solved|
+----+--------+--------+------+

Instructions for use, example usage, source code, and a link to the assembled JAR is available at the JSpark GitHub repo.

You can specify the parameters of JDBC connection using arguments or using a config file, for example: mycluster.conf.

To check or troubleshoot JDBC connections, download the fat JAR jspark.jar and launch it as a regular JAR. It includes hive-jdbc 1.2.1 and all required dependencies.