Problem
When performing model training or fine-tuning a base model using a GPU compute cluster, you encounter the following error (with varying GiB and MiB values) during these processes:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 492.00 MiB (GPU 0; 21.99 GiB total capacity; 20.84 GiB already allocated; 19.00 MiB free; 21.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
This error is typically raised by workloads that utilize PyTorch or other libraries that have PyTorch implementations, such as the Transformers library.
Cause
The CUDA memory is running out while trying to allocate additional memory for the model. The error arises when there is not enough free memory available.
Info
GPU memory is separate from the memory used by the worker and driver nodes of the cluster. GPU memory is specific to the GPU device being used for computations.
You can check GPU utilization by navigating to the Metrics tab of the cluster you use to run a notebook. From there, you can filter the results by selecting GPU from the dropdown button in the top-right corner of the page.
Solution
Select a suitable GPU device for your intended task, whether it's model training, fine-tuning, or inference. After determining which GPU device is best suited for your workload, navigate to Compute, select an existing cluster or create a new one, then select a Driver/Worker node type that utilizes the chosen GPU device. Once you've made this selection, you can resume working with your model.
Info
Each cloud provider decides which instance types are available in each region. Review the cloud provider documentation to determine if a specific GPU is available in the region (AWS, Azure, GCP) you are using.
Model training
Research the GPU devices available in compute instances for your cloud provider. For example, to address the problem stated in the error message, if your current cluster instance contains T4 GPU devices, consider switching to A10 or V100 devices, which offer larger memory capacities. Then, rerun your process.
Fine-tuning or inference
Check the model's repository on GitHub or its page on Hugging Face to see if specific GPU devices are recommended for specific tasks with that model. For example, Databricks' Dolly LLM Github repository specifies particular GPU instances to get started with response generation and training.