Problem
While performing model training on a GPU-enabled compute, you receive an error message similar to the following example.
warnings.warn(
OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU
File <command-xxxxxx>, line 88
85 return outputs[0]["generated_text"]
87 # Instantiate the Hugging Face–based LLMs
---> 88 response_llm = MistralLLM()
Cause
GPU memory utilization is high.
By default, the model loads in 32-bit precision. Loading models with higher parameters (for example, greater than seven billion) requires higher-capacity GPU memory than the default.
Solution
Update to a higher-capacity GPU node.
Alternatively, if you do not want to use a higher-capacity GPU node, modify your model’s from_pretrained()
call to reduce the precision from 32 bit to 16 bit and automatically distribute model layers across available devices (GPU or CPU) without manual checks.
Make sure to comment out the device=0 if torch.cuda.is_available() else -1
line of code, which is a manual check for GPU availability. It conflicts with device_map="auto"
.
from transformers import AutoModelForCausalLM
import torch
#Comment out this block:
#device=0 if torch.cuda.is_available() else -1 #This is where we check if to use GPU if available
model = AutoModelForCausalLM.from_pretrained(
"<your-model-name>",
torch_dtype=torch.float16,#This will Force FP16 reducing precision from 32 bit to 16 bit.
device_map="auto" #Automatically place layers on GPU/CPU
)