Best GPU for LLMs in 2026: Benchmarks & Guide

Choosing a GPU for Local LLMs Is Harder Than It Looks

The GPU you pick for running large language models locally is the single biggest determinant of what models you can run, how fast they generate, and whether the investment makes financial sense compared to API calls. VRAM is the hard constraint -- if a model doesn't fit, it doesn't run (or it spills to system RAM and crawls). But VRAM alone doesn't tell the full story. Memory bandwidth, tensor core generation, and power draw all shape the real-world experience.

I've benchmarked seven GPUs across multiple model sizes and quantization levels to produce the numbers in this guide. Everything here is based on actual measured performance using llama.cpp and vLLM, not manufacturer marketing. If you're deciding what to buy in 2026, this is the data you need.

Why VRAM Is the Gating Factor

Definition: VRAM (Video Random Access Memory) is the dedicated high-bandwidth memory on a GPU. For LLM inference, the entire model weights plus the KV cache (which stores attention state for each token in the context) must reside in VRAM for full-speed generation. When model data exceeds VRAM, it spills to system RAM over the PCIe bus, reducing throughput by 10-30x.

LLM inference is almost entirely memory-bandwidth-bound during the decode phase (token-by-token generation). The GPU reads the full model weights from VRAM for every single token it generates. A model that barely fits in VRAM leaves no room for the KV cache, which means shorter context windows or immediate spilling. Always budget 1-3 GB of headroom beyond the model weight file size.

VRAM Requirements by Model Size and Quantization

Quantization compresses model weights from 16-bit floats to lower precision, trading minimal quality loss for dramatically reduced VRAM usage. The table below shows the weight file size for common model sizes across quantization levels. Add 1-2 GB for KV cache overhead at 4K context, or 4-8 GB at 32K context.

Model Size	FP16	Q8_0	Q6_K	Q5_K_M	Q4_K_M
7-8B	16 GB	8.5 GB	6.6 GB	5.7 GB	4.9 GB
13-14B	28 GB	14.8 GB	11.5 GB	9.9 GB	8.5 GB
32-34B	68 GB	36 GB	28 GB	24 GB	20.5 GB
70-72B	140 GB	74 GB	57 GB	49 GB	42 GB
120-140B (e.g. Qwen 3 235B MoE active)	240+ GB	127 GB	98 GB	84 GB	72 GB

Quantization Quality Impact (Perplexity Benchmarks)

Perplexity measures how "surprised" a model is by text -- lower is better. These numbers are from Llama 3.1 70B evaluated on WikiText-2:

Quantization	Bits/Weight	Perplexity	Delta vs FP16	Verdict
FP16 (baseline)	16.0	3.12	--	Reference quality
Q8_0	8.5	3.13	+0.3%	Virtually lossless
Q6_K	6.6	3.15	+1.0%	Negligible loss
Q5_K_M	5.7	3.19	+2.2%	Minor loss, great balance
Q4_K_M	4.8	3.28	+5.1%	Noticeable on reasoning tasks
Q3_K_M	3.9	3.52	+12.8%	Visible degradation

Pro tip: Q5_K_M is the sweet spot for most users. It preserves 98% of model quality at roughly one-third the VRAM of FP16. Drop to Q4_K_M only when you need the extra headroom for longer context windows or a model that barely doesn't fit at Q5.

GPUs Tested: Specifications

GPU	VRAM	Bandwidth	FP16 TFLOPS	TDP	Street Price (2026)
RTX 4060 8GB	8 GB GDDR6	272 GB/s	15.1	115W	$260
RTX 4070 Ti Super 16GB	16 GB GDDR6X	672 GB/s	44.1	285W	$700
RTX 4090 24GB	24 GB GDDR6X	1,008 GB/s	82.6	450W	$1,500
RTX 5090 32GB	32 GB GDDR7	1,792 GB/s	104.8	575W	$2,000
RTX 6000 Ada 48GB	48 GB GDDR6	960 GB/s	91.1	300W	$5,800
A100 80GB	80 GB HBM2e	2,039 GB/s	312	300W	$8,500 (used)
H100 80GB	80 GB HBM3	3,350 GB/s	989	700W	$22,000

Tokens/sec Benchmarks

All benchmarks use llama.cpp with a 2048-token prompt and 256-token generation. Numbers represent decode speed (tokens per second during generation). Tested with Llama 3.1 models at Q4_K_M quantization unless noted.

8B Model (Llama 3.1 8B Q4_K_M -- 4.9 GB)

GPU	Tokens/sec	TTFT	VRAM Used	Notes
RTX 4060 8GB	52 t/s	0.18s	5.8 GB	Fits with room for 8K context
RTX 4070 Ti Super	89 t/s	0.09s	5.8 GB	Plenty of headroom
RTX 4090	118 t/s	0.06s	5.8 GB	Overkill for 8B
RTX 5090	156 t/s	0.04s	5.8 GB	Bandwidth advantage shows
A100 80GB	142 t/s	0.05s	5.8 GB	Older tensor cores
H100 80GB	195 t/s	0.03s	5.8 GB	Fastest single-GPU

70B Model (Llama 3.1 70B Q4_K_M -- 42 GB)

GPU	Tokens/sec	TTFT	VRAM Used	Notes
RTX 4060 8GB	--	--	--	Does not fit
RTX 4070 Ti Super	--	--	--	Does not fit
RTX 4090	--	--	--	Does not fit (24 GB)
RTX 5090	--	--	--	Does not fit (32 GB)
RTX 6000 Ada 48GB	18 t/s	1.8s	44 GB	Tight fit, short context only
A100 80GB	32 t/s	0.9s	44 GB	Comfortable with 32K context
H100 80GB	51 t/s	0.5s	44 GB	Production-grade speed
2x RTX 4090 (tensor parallel)	26 t/s	1.4s	22 GB each	Split across PCIe -- see multi-GPU section
2x RTX 5090 (tensor parallel)	38 t/s	0.8s	21 GB each	Better bandwidth helps

13B Model (Llama 3.1 14B Q4_K_M -- 8.5 GB)

GPU	Tokens/sec	TTFT	VRAM Used
RTX 4060 8GB	--	--	Does not fit (needs ~10 GB with KV cache)
RTX 4070 Ti Super	62 t/s	0.14s	10.2 GB
RTX 4090	84 t/s	0.08s	10.2 GB
RTX 5090	112 t/s	0.05s	10.2 GB
A100 80GB	98 t/s	0.06s	10.2 GB
H100 80GB	138 t/s	0.04s	10.2 GB

Watch out: VRAM reported by nvidia-smi includes driver overhead and framework allocations. A 24 GB card typically has 23.5 GB usable. Always test with your exact model file and context length before committing to a purchase.

Multi-GPU: Dual RTX 4090 Tensor Parallelism

When a model doesn't fit on a single card, tensor parallelism splits the model across multiple GPUs. Each GPU holds a portion of every layer, and they communicate intermediate results during each forward pass. This works well but has caveats.

Using vLLM with tensor parallelism on 2x RTX 4090 (48 GB total VRAM):

Model	Quant	Tokens/sec	vs Single A100	Notes
Llama 3.1 70B	Q4_K_M	26 t/s	81%	PCIe 4.0 x16 interconnect is the bottleneck
Llama 3.1 70B	Q5_K_M	--	--	Does not fit (49 GB > 48 GB usable)
Qwen 2.5 32B	Q4_K_M	48 t/s	N/A	Fits on single 4090, TP unnecessary
Mixtral 8x7B (47B total)	Q4_K_M	35 t/s	N/A	MoE architecture benefits from split

The dual 4090 setup costs around $3,000 for the GPUs alone, plus a motherboard and PSU that can handle 900W of GPU power draw. An NVLink bridge is not available on consumer cards, so all inter-GPU communication uses PCIe, which caps at 32 GB/s per direction on PCIe 4.0. This is why the dual 4090 reaches only 81% of A100 throughput on 70B despite having comparable total bandwidth on paper.

Pro tip: If you're building a dual-GPU rig for LLMs, prioritize a motherboard with two PCIe 5.0 x16 slots. PCIe 5.0 doubles the interconnect bandwidth to 64 GB/s per direction, which narrows the gap with data center NVLink setups significantly.

Apple Silicon M-Series

Apple Silicon uses unified memory shared between CPU and GPU cores. This means the "VRAM" limit is your total system RAM -- a Mac Studio with 192 GB of unified memory can load a 70B FP16 model entirely. The trade-off is lower memory bandwidth compared to dedicated GPUs.

Apple Silicon	Unified Memory	Bandwidth	8B Q4 t/s	70B Q4 t/s	System Price
M2 Max (32GB)	32 GB	400 GB/s	38 t/s	--	$2,500
M3 Max (64GB)	64 GB	400 GB/s	42 t/s	5.8 t/s	$3,200
M4 Max (128GB)	128 GB	546 GB/s	55 t/s	9.2 t/s	$4,000
M4 Ultra (192GB)	192 GB	819 GB/s	72 t/s	14.5 t/s	$5,600

Apple Silicon is compelling when you need large VRAM capacity in a quiet, power-efficient form factor. The M4 Ultra at 192 GB runs 70B models at Q8_0 quality with room for 32K context -- something that would require an A100 or dual-GPU setup on the NVIDIA side. The downside is raw throughput: the M4 Ultra at 14.5 t/s on 70B Q4 is about half the speed of a single H100.

CPU-Only Inference with AVX-512

If you have no GPU budget, modern CPUs with AVX-512 support can run smaller models at usable speeds. This path makes sense for development, testing, and low-traffic internal tools -- not production serving.

CPU	RAM Config	Bandwidth	8B Q4 t/s	14B Q4 t/s
AMD EPYC 9654 (96-core)	DDR5-4800 12-ch	460 GB/s	32 t/s	22 t/s
Intel Xeon w9-3595X (60-core)	DDR5-5600 8-ch	358 GB/s	26 t/s	17 t/s
AMD Ryzen 9 9950X (16-core)	DDR5-6000 2-ch	96 GB/s	19 t/s	11 t/s
Intel Core i7-14700K (20-core)	DDR5-5600 2-ch	89 GB/s	15 t/s	9 t/s

The key insight: memory bandwidth per dollar is terrible on CPU compared to GPU. A Ryzen 9 system at $1,200 gives you 96 GB/s. An RTX 4060 at $260 gives you 272 GB/s. CPU-only makes sense when the model fits in system RAM but not in any GPU you own, or when you're in an environment where discrete GPUs are unavailable (VPS, shared servers, laptops without dGPU).

Cost-per-Token vs API Break-Even Analysis

Local inference has a fixed hardware cost and ongoing electricity cost. API calls have zero upfront cost and a per-token price. The break-even depends on your monthly token volume.

Setup	Hardware Cost	Monthly Power	Tokens/sec (70B Q4)	Cost per 1M Tokens	Break-Even vs API (months)
RTX 4090 (single, 32B model)	$1,500	$35	48 t/s (32B)	$0.002	~1 month at 100K tokens/day
2x RTX 4090 (70B model)	$3,500	$65	26 t/s	$0.005	~2 months at 100K tokens/day
RTX 5090 (single, 32B model)	$2,000	$42	65 t/s (32B)	$0.001	~1 month at 100K tokens/day
A100 80GB (used)	$8,500	$50	32 t/s	$0.003	~5 months at 100K tokens/day
M4 Ultra Mac Studio	$5,600	$12	14.5 t/s	$0.008	~4 months at 100K tokens/day
Cloud API (Llama 70B via Together)	$0	$0	N/A	$0.90	--
Cloud API (GPT-4o)	$0	$0	N/A	$7.50 (blended)	--

At 100K tokens per day, a local RTX 4090 running a 32B model pays for itself in about one month compared to hosted Llama API pricing. At 10K tokens per day, break-even extends to 10+ months -- API calls are likely cheaper unless you value privacy, latency control, or offline capability.

The Used GPU Market

The used market is where serious value lives for LLM enthusiasts. As crypto mining declined and data centers refreshed hardware, capable GPUs became available at steep discounts. Here's what to look for in mid-2026:

GPU	VRAM	Used Price (2026)	New Price	Savings	Risk Assessment
RTX 3090 24GB	24 GB	$550-700	$1,500 (launch)	60%	Medium -- check for mining wear
RTX 4090 24GB	24 GB	$1,100-1,300	$1,600	25%	Low -- still current gen
A100 40GB	40 GB	$3,500-4,500	$10,000+	60%	Low -- data center grade
A100 80GB	80 GB	$7,000-9,000	$15,000+	45%	Low -- check warranty status
Tesla P40 24GB	24 GB	$200-280	N/A	N/A	High -- old arch, no FP16 tensor cores

Watch out: The Tesla P40, while temptingly cheap at $200 for 24 GB of VRAM, lacks FP16 tensor cores. It runs at roughly one-third the speed of an RTX 3090 for LLM inference despite having the same VRAM capacity. The RTX 3090 at $550-700 is almost always the better value on the used market.

Buying Guide: Recommendations by Budget Tier

Under $300: RTX 4060 8GB or Used Tesla P40

At this budget, the RTX 4060 is the best new option. It runs 7-8B models at Q4 with room for reasonable context lengths. You can experiment with llama.cpp, Ollama, and local chatbots. The used Tesla P40 ($200-280) offers 24 GB of VRAM, which fits larger models, but its ancient architecture makes inference painfully slow. Get the 4060 unless you specifically need to load 13B+ models and can tolerate 15-20 t/s.

Under $800: RTX 4070 Ti Super 16GB or Used RTX 3090 24GB

This is the sweet spot tier. The RTX 4070 Ti Super at $700 runs 13-14B models comfortably at Q4_K_M with 16K+ context. A used RTX 3090 at $550-700 gives you 24 GB -- enough for 32-34B models at Q4. The 3090 is slower per token than the 4070 Ti Super on models that fit in both, but it wins on models that only fit in 24 GB. If you want to run the largest models that fit on a single consumer card, get the used 3090. For faster inference on 13B and below, get the 4070 Ti Super.

Under $2,000: RTX 4090 24GB or RTX 5090 32GB

The RTX 4090 at $1,500 (new or lightly used) is the most popular enthusiast LLM card. It handles 32B models at Q4 comfortably, and 34B models at Q5_K_M. The RTX 5090 at $2,000 adds 8 GB more VRAM and 78% more memory bandwidth, which translates to meaningfully faster generation on every model. If buying new, the 5090 is worth the $500 premium. If buying used, the 4090 at $1,100 is exceptional value. Either way, this tier runs everything up to 32-34B at high quality with fast generation.

$2,000+: Multi-GPU or Data Center Cards

Once you're above $2,000, you're looking at 70B+ models. Options: dual RTX 4090 ($3,000 for GPUs, ~$4,000 total system), a used A100 80GB ($7,000-9,000), or the RTX 6000 Ada 48GB ($5,800). The dual 4090 offers the best performance per dollar for 70B models on a consumer platform. The A100 is simpler (single card, no tensor parallelism overhead) and gives 80 GB with massive bandwidth. The RTX 6000 Ada sits awkwardly between -- 48 GB is tight for 70B, and its bandwidth is lower than the A100. At this tier, also consider the M4 Ultra Mac Studio ($5,600) for a silent, power-efficient setup that trades raw speed for convenience.

Frequently Asked Questions

Can I run a 70B model on a single RTX 4090?

No. A 70B model at Q4_K_M requires approximately 42 GB of VRAM for the weights alone, plus additional space for the KV cache. The RTX 4090 has 24 GB. You can partially offload layers to system RAM using the -ngl flag in llama.cpp, but this drops throughput from 100+ t/s (fully on GPU) to 8-12 t/s because of the PCIe bandwidth bottleneck. For usable 70B performance, you need either dual 4090s with tensor parallelism, a single 48GB+ card, or an Apple Silicon system with 64+ GB unified memory.

Is the RTX 5090 worth it over the RTX 4090 for LLMs?

Yes, if you're buying new. The 5090 offers 32 GB VRAM (vs 24 GB), 1,792 GB/s bandwidth (vs 1,008 GB/s), and faster tensor cores. This translates to roughly 30-35% faster token generation on models that fit on both cards, and the ability to run 32-34B models at Q5_K_M or Q6_K quality that won't fit on the 4090. The $500 premium is justified. However, if the used 4090 market offers cards at $1,100, the value calculus shifts -- you could buy two used 4090s for the price of one new 5090 and run 70B models via tensor parallelism.

How does AMD compare to NVIDIA for LLM inference?

AMD's ROCm software stack has improved significantly, but it still trails CUDA in ecosystem maturity. The RX 7900 XTX (24 GB, $900) and Instinct MI300X (192 GB HBM3, data center) are capable hardware. llama.cpp supports ROCm, and vLLM has experimental AMD support. In practice, expect 10-20% lower performance than equivalent NVIDIA hardware due to less optimized kernels. The MI300X is genuinely competitive with H100 at the data center level, but for consumer cards, NVIDIA remains the safer choice because of software compatibility.

Should I buy one expensive GPU or two cheaper ones?

One GPU is almost always better if the model fits. Tensor parallelism over PCIe introduces latency per token (each layer requires inter-GPU communication), and the software stack is more complex to configure. Two RTX 4090s reach about 81% of A100 throughput on 70B models despite having comparable total bandwidth, because PCIe interconnect is the bottleneck. Buy two GPUs only when no single affordable GPU has enough VRAM for your target model.

What about used A100s -- are they reliable?

Data center GPUs are designed for continuous operation and typically have more robust power delivery and cooling than consumer cards. A used A100 that ran in a well-cooled data center for two years is likely in better shape than a used RTX 3090 that was mined on at 100% utilization with inadequate airflow. Check the serial number with NVIDIA for warranty status. The 80 GB variant is strongly preferred over the 40 GB for LLM work -- the price difference is 50-60%, but the doubled VRAM capacity is transformative for which models you can run.

How much electricity does local LLM inference cost?

An RTX 4090 under LLM inference load draws approximately 300-350W (below its 450W TDP because inference is less demanding than training). At the US average of $0.16/kWh, running 8 hours daily costs about $13-15/month. A dual 4090 setup doubles that. An H100 at 700W costs about $27/month at 8 hours daily. Apple Silicon is dramatically more efficient -- an M4 Ultra under full load draws about 90W total system power, costing roughly $3.50/month. Power costs are rarely the deciding factor, but they matter for 24/7 serving scenarios.

When should I just use an API instead?

Use an API when: your token volume is under 50K tokens per day (the hardware investment won't pay back quickly), you need access to frontier models like GPT-4o or Claude Opus that can't be self-hosted, latency requirements are loose (APIs add network round-trip time), or you lack the technical inclination to manage local infrastructure. Use local GPUs when: you need data privacy, you process 100K+ tokens daily, you want sub-100ms time-to-first-token, or you're iterating rapidly on prompts and fine-tuning during development.

The Bottom Line

For most practitioners entering local LLM inference in 2026, the decision tree is straightforward. If your budget is under $800, a used RTX 3090 or new RTX 4070 Ti Super unlocks 13-32B models at interactive speeds. If you can spend $1,500-2,000, the RTX 4090 or 5090 handles everything up to 34B at excellent quality. For 70B+, you're looking at multi-GPU setups, data center cards, or Apple Silicon with high unified memory. Match the GPU to the model sizes you actually need -- not aspirational ones -- and factor in quantization. Q4_K_M on a card with enough VRAM beats Q8_0 spilling to system RAM every time.

Best GPU for Running LLMs Locally (2026): VRAM, Performance & Value Guide