How to dump tables in CSV, JSON, XML, text, or HTML format

Learn how to output tables from Databricks in CSV, JSON, XML, text, or HTML format.

Written by Adam Pavlacka

Last published at: May 25th, 2022

You want to send results of your computations in Databricks outside Databricks. You can use BI tools to connect to your cluster via JDBC and export results from the BI tools, or save your tables in DBFS or blob storage and copy the data via REST API.

This article introduces JSpark, a simple console tool for executing SQL queries using JDBC on Spark clusters to dump remote tables to local disk in CSV, JSON, XML, Text, and HTML format. 

For example:

%sh

java -Dconfig.file=mycluster.conf -jar jspark.jar -q "select id, type, priority, status from tickets limit 5"

returns:

+----+--------+--------+------+
|  id|type    |priority|status|
+----+--------+--------+------+
|9120|problem |urgent  |closed|
|9121|question|normal  |hold  |
|9122|incident|normal  |closed|
|9123|question|normal  |open  |
|9124|incident|normal  |solved|
+----+--------+--------+------+

Instructions for use, example usage, source code, and a link to the assembled JAR is available at the JSpark GitHub repo.

You can specify the parameters of JDBC connection using arguments or using a config file, for example: mycluster.conf.

To check or troubleshoot JDBC connections, download the fat JAR jspark.jar and launch it as a regular JAR. It includes hive-jdbc 1.2.1 and all required dependencies.