Slowness when using the foundational model API with pay-per-token mode

Switch to provisioned throughput mode for high throughput and performance guarantee requirements.

Last published at: January 7th, 2025

Problem

When you access Databricks Foundation Model APIs in pay-per-token mode, you notice extended response times and reduced model inference operation efficiency, particularly with first-token generation.

Cause

The Foundation Model APIs pay-per-token mode is a multi-tenant service. When multiple customers send requests with long contexts, they consume most of the available GPU resources. As a result, the first token latency can be significantly delayed.

This mode is not designed for high-throughput applications or performant production workloads.

Solution

For production workloads requiring:

High throughput
Performance guarantees
Fine-tuned models
Enhanced security requirements

Databricks recommends choosing the provisioned throughput mode instead. It is specifically designed to meet these production-grade requirements.

For more information, please refer to the Databricks Foundation Model APIs (AWS | Azure | GCP) documentation.

Databricks Help Center

Problem

Cause

Solution

Contact Us