Problem
When you access Databricks Foundation Model APIs in pay-per-token mode, you notice extended response times and reduced model inference operation efficiency, particularly with first-token generation.
Cause
The Foundation Model APIs pay-per-token mode is a multi-tenant service. When multiple customers send requests with long contexts, they consume most of the available GPU resources. As a result, the first token latency can be significantly delayed.
This mode is not designed for high-throughput applications or performant production workloads.
Solution
For production workloads requiring:
- High throughput
- Performance guarantees
- Fine-tuned models
- Enhanced security requirements
Databricks recommends choosing the provisioned throughput mode instead. It is specifically designed to meet these production-grade requirements.
For more information, please refer to the Databricks Foundation Model APIs (AWS | Azure | GCP) documentation.