Experimental benchmarks on the GPU requirements for the training/fine-tuning of LLMs.

June 12, 2023

2023 · dev · language_model, gpu

This post gathers experimental setups on the GPU usage for the training/fine-tuning of LLMs in the wild.

Table of Contents

Fine-tuning with DeepSpeed
Fine-tuning with FSDP
Fine-tuning with LoRA

Fine-tuning with DeepSpeed

My own experience:

Model	Context length	DS Zero offload	GPU	precision	batch_size_per_GPU	gradient_accumulation
CodeT5+ 2B	1024	Yes (50G CPU)	2x A100 40G	bf16	1	4
FLANT5-XL 3B	1024	Yes (50G CPU)	2x RTX 4090 24G	bf16	1	4

FLAN-XL (3B) + FLAN-XL (11B)

(Source: https://www.philschmid.de/fine-tune-flan-t5-deepspeed)

Fine-tuning with FSDP

My own experience:

Model Context length Full_shard GPU precision batch_size_per_GPU gradient_accumulation

Llama-Alpaca 6B 512 Yes 6x A100 40G bf16 1 4
GPT-2 XL (1.5B)

2x24GB NVIDIA RTX

(Source: https://huggingface.co/blog/pytorch-fsdp)

Fine-tuning with LoRA

My own experience:

Model Context length GPU precision batch_size_per_GPU gradient_accumulation

CodeT5+ 2B 1024 1x RTX 4090 24G fp16 1 8

TO (3B) + mTO (12B) + BLOOMZ (7B) (7B)

(Source: https://github.com/huggingface/peft