Problem
When you try to connect to the SFTP server from Databricks using passwordless authentication, you receive the following error.
Error Message : Host Key Verification Failed
Cause
Cluster restarts delete the data stored in the local disk. The private key is not preserved in the erasure, resulting in host key verification failures.
Solution
Create an RSA authentication key to access a remote site from your Databricks account and preserve the private key.
Generate an SSH key-pair
First, create an SSH key pair inside or outside Databricks accordingly. To create the RSA key pair in Databricks, run the following command in a Databricks notebook.
%sh
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
This command creates an RSA key pair without a passphrase at the following location.
- Private key:
~/.ssh/id_rsa
- Public key:
~/.ssh/id_rsa.pub
Preserve the public and private keys
- Copy or upload the generated public and private keys to a secure location, such as workspace files, cloud storage, or volume.
- Create a cluster init script and attach it to automate the restoration of the SSH keys during cluster startup. This init script can be used to copy the SSH keys from your secure location to the appropriate location on the cluster.
#!/bin/bash
sleep 5
#nodes don’t have .ssh by default
mkdir -p /root/.ssh/
#copy the private key to .ssh
cp <source-path-for-private-key> /root/.ssh/id_rsa
#modify the permissions of the private key file
chmod 400 /root/.ssh/id_rsa
For more information on creating an init script, refer to the What are init scripts? (AWS | Azure | GCP) documentation.
Copy the public key to a remote server
Copy the public key (id_rsa.pub
) to the remote server's ~/.ssh/authorized_keys
file. Ensure that the permissions of the ~/.ssh
folder and the authorized_keys
file on the remote server are set correctly to avoid access issues.
Test the connection from a Databricks notebook
%sh
ssh user@remote_host
Or
%sh
sftp user@remote_host