How to set up Apache Kafka on Databricks
This article explains how to set up Apache Kafka on AWS EC2 machines and connect them with Databricks. Following are the high level steps that are required to create a Kafka cluster and connect from Databricks notebooks.
Step 1: Create a new VPC in AWS
When creating the new VPC, set the new VPC CIDR range different than the Databricks VPC CIDR range. For example:
Databricks VPC
vpc-7f4c0d18
has CIDR IP range10.205.0.0/16
New VPC
vpc-8eb1faf7
has CIDR IP range10.10.0.0/16
Create a new internet gateway and attach it to the route table of the new VPC. This allows you to ssh into the EC2 machines that you launch under this VPC.
Create a new internet gateway.
Attach it to VPC
vpc-8eb1faf7
.
Step 2: Launch the EC2 instance in the new VPC
Launch the EC2 instance inside the new VPC vpc-8eb1faf7
created in Step 1.

Step 3: Install Kafka and ZooKeeper on the new EC2 instance
SSH into the machine with the key pair.
ssh -i keypair.pem ec2-user@ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com
Download Kafka and extract the archive.
wget https://apache.claz.org/kafka/0.10.2.1/kafka_2.12-0.10.2.1.tgz tar -zxf kafka_2.12-0.10.2.1.tgz
Start the ZooKeeper process.
cd kafka_2.12-0.10.2.1 bin/zookeeper-server-start.sh config/zookeeper.properties
Edit the
config/server.properties
file and set10.10.143.166
as the private IP of the EC2 node.advertised.listeners=PLAINTEXT:/10.10.143.166:9092
Start the Kafka broker.
cd kafka_2.12-0.10.2.1
bin/kafka-server-start.sh config/server.properties
Step 4: Peer two VPCs
Create a new peering connection.
Add the peering connection into the route tables of your Databricks VPC and new Kafka VPC created in Step 1.
In the Kafka VPC, go to the route table and add the route to the Databricks VPC.
In the Databricks VPC, go to the route table and add the route to the Kafka VPC.
For more information, see VPC Peering.
Step 5: Access the Kafka broker from a notebook
Verify you can reach the EC2 instance running the Kafka broker with telnet.
Create a new topic in the Kafka broker.
SSH to the Kafka broker.
ssh -i keypair.pem ec2-user@ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com
Create a topic from the command line.
bin/kafka-console-producer.sh --broker-list localhost:9092 --article wordcount < LICENSE
Read data in a notebook.
import org.apache.spark.sql.functions._ val kafka = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "10.10.143.166:9092") .option("subscribe", "wordcount") .option("startingOffsets", "earliest") display(kafka)
Example Kafka byte stream