Slowness when using the foundational model API with pay-per-token mode

Switch to provisioned throughput mode for high throughput and performance guarantee requirements.

Written by kaushal.vachhani

Last published at: January 7th, 2025

Problem

When you access Databricks Foundation Model APIs in pay-per-token mode, you notice extended response times and reduced model inference operation efficiency, particularly with first-token generation. 

 

Cause

The Foundation Model APIs pay-per-token mode is a multi-tenant service. When multiple customers send requests with long contexts, they consume most of the available GPU resources. As a result, the first token latency can be significantly delayed. 

 

This mode is not designed for high-throughput applications or performant production workloads.

 

Solution

For production workloads requiring:  

  • High throughput
  • Performance guarantees
  • Fine-tuned models
  • Enhanced security requirements

Databricks recommends choosing the provisioned throughput mode instead. It is specifically designed to meet these production-grade requirements.

For more information, please refer to the Databricks Foundation Model APIs (AWSAzureGCP) documentation.