Upload large files using DBFS API 2.0 and PowerShell

Use PowerShell and the DBFS API to upload large files to your Databricks workspace.

Written by ravirahul.padmanabhan

Last published at: September 27th, 2022

Using the Databricks REST API to interact with your clusters programmatically can be a great way to streamline workflows with scripts.

The API can be called with various tools, including PowerShell. In this article, we are going to take a look at an example DBFS put command using curl and then show you how to execute that same command using PowerShell. 

The DBFS API 2.0 put command (AWS | Azure) limits the amount of data that can be passed using the contents parameter to 1 MB if the data is passed as a string. The same command can pass 2 GB if the data is passed as a file. It is mainly used for streaming uploads, but can also be used as a convenient single call for data upload.

Curl example

This example uses curl to send a simple multipart form post request to the API to upload a file up to 2 GB in size.

Replace all of the values in <> with appropriate values for your environment.

Delete

Info

To get your workspace URL, review Workspace instance names, URLs, and IDs (AWSAzure).

Review the Generate a personal access token (AWS | Azure) documentation for details on how to create a personal access token for use with the REST APIs.

# Parameters
databricks_workspace_url="<databricks-workspace-url>"
personal_access_token="<personal-access-token>"
local_file_path="<local_file_path>"              # ex: /Users/foo/Desktop/file_to_upload.png
dbfs_file_path="<dbfs_file_path>"                # ex: /tmp/file_to_upload.png
overwrite_file="<true|false>"


curl --location --request POST https://${databricks_workspace_url}/api/2.0/dbfs/put \
     --header "Authorization: Bearer ${personal_access_token}" \
     --form contents=@${local_file_path} \
     --form path=${dbfs_file_path} \
     --form overwrite=${overwrite_file}

PowerShell example

This PowerShell example is longer than the curl example, but it sends the same multipart form post request to the API.

The below script can be used in any environment where PowerShell is supported.

To run the PowerShell script you must:

  1. Replace all of the values in <> with appropriate values for your environment. Review the DBFS API 2.0 put documentation for more information.
  2. Save the script as a .ps1 file. For example, you could call it upload_large_file_to_dbfs.ps1.
  3. Execute the script in PowerShell by running ./upload_large_file_to_dbfs.ps1 at the prompt.
################################################## Parameters
$DBX_HOST = "<databricks-workspace-url>"
$DBX_TOKEN = "<personal-access-token>"
$FILE_TO_UPLOAD = "<local_file_path>"      # ex: /Users/foo/Desktop/file_to_upload.png  
$DBFS_PATH = "<dbfs_file_path>"            # ex: /tmp/file_to_upload.png
$OVERWRITE_FILE = "<true|false>"
##################################################


# Configure authentication
$headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]"
$headers.Add("Authorization", "Bearer "  + $DBX_TOKEN)

$multipartContent = [System.Net.Http.MultipartFormDataContent]::new()

# Local file path
$FileStream = [System.IO.FileStream]::new($FILE_TO_UPLOAD, [System.IO.FileMode]::Open)
$fileHeader = [System.Net.Http.Headers.ContentDispositionHeaderValue]::new("form-data")
$fileHeader.Name = $(Split-Path $FILE_TO_UPLOAD -leaf)
$fileHeader.FileName = $(Split-Path $FILE_TO_UPLOAD -leaf)
$fileContent = [System.Net.Http.StreamContent]::new($FileStream)
$fileContent.Headers.ContentDisposition = $fileHeader
$fileContent.Headers.ContentType = [System.Net.Http.Headers.MediaTypeHeaderValue]::Parse("text/plain")
$multipartContent.Add($fileContent)


# DBFS path
$stringHeader = [System.Net.Http.Headers.ContentDispositionHeaderValue]::new("form-data")
$stringHeader.Name = "path"
$stringContent = [System.Net.Http.StringContent]::new($DBFS_PATH)
$stringContent.Headers.ContentDisposition = $stringHeader
$multipartContent.Add($stringContent)


# File overwrite config
$stringHeader = [System.Net.Http.Headers.ContentDispositionHeaderValue]::new("form-data")
$stringHeader.Name = "overwrite"
$stringContent = [System.Net.Http.StringContent]::new($OVERWRITE_FILE)
$stringContent.Headers.ContentDisposition = $stringHeader
$multipartContent.Add($stringContent)


# Call Databricks DBFS REST API
$body = $multipartContent
$uri = 'https://' + $DBX_HOST + '/api/2.0/dbfs/put'
$response = Invoke-RestMethod $uri -Method 'POST' -Headers $headers -Body $body
$response | ConvertTo-Json
Delete

Info

You can use PowerShell scripts in Linux and OS X as well as Windows. The command to run a PowerShell script is slightly different in those environments. Refer to the PowerShell documentation if you are trying to run the script on a platform other than Windows.