Replace a default library jar
Databricks includes a number of default Java and Scala libraries. You can replace any of these libraries with another version by using a cluster-scoped init script to remove the default library jar and then install the version you require. Warning Removing default libraries and installing new versions may cause instability or completely break your D...
1 min reading timeHow to specify the DBFS path
When working with Databricks you will sometimes have to access the Databricks File System (DBFS). Accessing files on DBFS is done with standard filesystem commands, however the syntax varies depending on the language or tool used. For example, take the following DBFS path: dbfs:/mnt/test_folder/test_folder1/ Apache Spark Under Spark, you should spec...
0 min reading timeGenerate unique increasing numeric values
This article shows you how to use Apache Spark functions to generate unique increasing numeric values in a column. We review three different methods to use. You should select the method that works best with your use case. Use zipWithIndex() in a Resilient Distributed Dataset (RDD) The zipWithIndex() function is only available within RDDs. You cannot...
1 min reading timeCreate tables on JSON datasets
In this article we cover how to create a table on JSON datasets using SerDe. Download the JSON SerDe JAR Open the hive-json-serde 1.3.8 download page. Click on json-serde-1.3.8-jar-with-dependencies.jar to download the file json-serde-1.3.8-jar-with-dependencies.jar. Info You can review the Hive-JSON-Serde GitHub repo for more information on the JAR...
0 min reading timeCreate a DataFrame from a JSON string or Python dictionary
In this article we are going to review how you can create an Apache Spark DataFrame from a variable containing a JSON string or a Python dictionary. Create a Spark DataFrame from a JSON string Add the JSON content from the variable to a list.%scala import scala.collection.mutable.ListBuffer val json_content1 = "{'json_col1': 'hello', 'json_col2': 32...
2 min reading timeBest practice for cache(), count(), and take()
cache() is an Apache Spark transformation that can be used on a DataFrame, Dataset, or RDD when you want to perform more than one action. cache() caches the specified DataFrame, Dataset, or RDD in the memory of your cluster’s workers. Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(),...
1 min reading time