How to set up Apache Kafka on Databricks

This article explains how to set up Apache Kafka on AWS EC2 machines and connect them with Databricks. Following are the high level steps that are required to create a Kafka cluster and connect from Databricks notebooks.

Step 1: Create a new VPC in AWS

  1. When creating the new VPC, set the new VPC CIDR range different than the Databricks VPC CIDR range. For example:

    • Databricks VPC vpc-7f4c0d18 has CIDR IP range

    • New VPC vpc-8eb1faf7 has CIDR IP range

  2. Create a new internet gateway and attach it to the route table of the new VPC. This allows you to ssh into the EC2 machines that you launch under this VPC.

    1. Create a new internet gateway.

    2. Attach it to VPC vpc-8eb1faf7.


Step 2: Launch the EC2 instance in the new VPC

Launch the EC2 instance inside the new VPC vpc-8eb1faf7 created in Step 1.


Step 3: Install Kafka and ZooKeeper on the new EC2 instance

  1. SSH into the machine with the key pair.

    ssh -i keypair.pem
  2. Download Kafka and extract the archive.

    tar -zxf kafka_2.12-
  3. Start the ZooKeeper process.

    cd kafka_2.12-
    bin/ config/
  4. Edit the config/ file and set as the private IP of the EC2 node.

  5. Start the Kafka broker.

cd kafka_2.12-
bin/ config/

Step 4: Peer two VPCs

  1. Create a new peering connection.

  2. Add the peering connection into the route tables of your Databricks VPC and new Kafka VPC created in Step 1.

    • In the Kafka VPC, go to the route table and add the route to the Databricks VPC.

    • In the Databricks VPC, go to the route table and add the route to the Kafka VPC.


For more information, see VPC Peering.

Step 5: Access the Kafka broker from a notebook

  1. Verify you can reach the EC2 instance running the Kafka broker with telnet.

  2. Create a new topic in the Kafka broker.

    1. SSH to the Kafka broker.

      ssh -i keypair.pem
    2. Create a topic from the command line.

      bin/ --broker-list localhost:9092 --article wordcount < LICENSE
  3. Read data in a notebook.

    import org.apache.spark.sql.functions._
    val kafka = spark.readStream
            .option("kafka.bootstrap.servers", "")
            .option("subscribe", "wordcount")
            .option("startingOffsets", "earliest")

    Example Kafka byte stream