How to set up Apache Kafka on Databricks

Learn how to set up Apache Kafka on Databricks.

Written by Adam Pavlacka

Last published at: May 18th, 2022

This article explains how to set up Apache Kafka on AWS EC2 machines and connect them with Databricks. Following are the high level steps that are required to create a Kafka cluster and connect from Databricks notebooks.

Step 1: Create a new VPC in AWS

  1. When creating the new VPC, set the new VPC CIDR range different than the Databricks VPC CIDR range. For example:

    • Databricks VPC vpc-7f4c0d18 has CIDR IP range 10.205.0.0/16.
      Databricks VPC example.
    • New VPC vpc-8eb1faf7 has CIDR IP range 10.10.0.0/16.
      New VPC example.
  2. Create a new internet gateway and attach it to the route table of the new VPC. This allows you to ssh into the EC2 machines that you launch under this VPC.

    • Create a new internet gateway.
      Setup new internet gateway.
    • Attach it to VPC vpc-8eb1faf7.
      Attach gateway to VPC.

Step 2: Launch the EC2 instance in the new VPC

Launch the EC2 instance inside the new VPC vpc-8eb1faf7 created in Step 1.

Launch EC2 in the new VPC.

Step 3: Install Kafka and ZooKeeper on the new EC2 instance

  1. SSH into the machine with the key pair.
    ssh -i keypair.pem ec2-user@ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com
  2. Download Kafka and extract the archive.
    wget https://apache.claz.org/kafka/0.10.2.1/kafka_2.12-0.10.2.1.tgz
    tar -zxf kafka_2.12-0.10.2.1.tgz
  3. Start the ZooKeeper process.
    cd kafka_2.12-0.10.2.1
    bin/zookeeper-server-start.sh config/zookeeper.properties
  4. Edit the config/server.properties file and set 10.10.143.166as the private IP of the EC2 node.
    advertised.listeners=PLAINTEXT:/10.10.143.166:9092
  5. Start the Kafka broker.
    cd kafka_2.12-0.10.2.1
    bin/kafka-server-start.sh config/server.properties

Step 4: Peer two VPCs

  1. Create a new peering connection.
    Setup a new peering connection.
  2. Add the peering connection into the route tables of your Databricks VPC and new Kafka VPC created in Step 1.

    • In the Kafka VPC, go to the route table and add the route to the Databricks VPC.
      Kafka VPC route table.
    • In the Databricks VPC, go to the route table and add the route to the Kafka VPC.
      Databricks VPC route table.

For more information, see VPC Peering.

Step 5: Access the Kafka broker from a notebook

  1. Verify you can reach the EC2 instance running the Kafka broker with telnet.
    Example telnet output.
  2. SSH to the Kafka broker.
    %sh
    
    ssh -i keypair.pem ec2-user@ec2-xx-xxx-xx-xxx.us-west-2.compute.amazonaws.com
  3. Create a new topic in the Kafka broker from the command line.
    %sh
    
    bin/kafka-console-producer.sh --broker-list localhost:9092 --article wordcount < LICENSE
  4. Read data in a notebook.
    %scala
    
    import org.apache.spark.sql.functions._
    val kafka = spark.readStream
            .format("kafka")
            .option("kafka.bootstrap.servers", "10.10.143.166:9092")
            .option("subscribe", "wordcount")
            .option("startingOffsets", "earliest")
    display(kafka)