Skip to content
AI/ML Engineering

Best GPU for Running LLMs Locally (2026): VRAM, Performance & Value Guide

VRAM requirements, tokens/sec benchmarks, and cost-per-token analysis for every GPU worth considering in 2026. Covers RTX 4060 through H100, multi-GPU setups, Apple Silicon, and budget tiers from $260 to $22,000.

A
Abhishek Patel16 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Best GPU for Running LLMs Locally (2026): VRAM, Performance & Value Guide
Best GPU for Running LLMs Locally (2026): VRAM, Performance & Value Guide

Choosing a GPU for Local LLMs Is Harder Than It Looks

The GPU you pick for running large language models locally is the single biggest determinant of what models you can run, how fast they generate, and whether the investment makes financial sense compared to API calls. VRAM is the hard constraint -- if a model doesn't fit, it doesn't run (or it spills to system RAM and crawls). But VRAM alone doesn't tell the full story. Memory bandwidth, tensor core generation, and power draw all shape the real-world experience.

I've benchmarked seven GPUs across multiple model sizes and quantization levels to produce the numbers in this guide. Everything here is based on actual measured performance using llama.cpp and vLLM, not manufacturer marketing. If you're deciding what to buy in 2026, this is the data you need.

Why VRAM Is the Gating Factor

Definition: VRAM (Video Random Access Memory) is the dedicated high-bandwidth memory on a GPU. For LLM inference, the entire model weights plus the KV cache (which stores attention state for each token in the context) must reside in VRAM for full-speed generation. When model data exceeds VRAM, it spills to system RAM over the PCIe bus, reducing throughput by 10-30x.

LLM inference is almost entirely memory-bandwidth-bound during the decode phase (token-by-token generation). The GPU reads the full model weights from VRAM for every single token it generates. A model that barely fits in VRAM leaves no room for the KV cache, which means shorter context windows or immediate spilling. Always budget 1-3 GB of headroom beyond the model weight file size.

VRAM Requirements by Model Size and Quantization

Quantization compresses model weights from 16-bit floats to lower precision, trading minimal quality loss for dramatically reduced VRAM usage. The table below shows the weight file size for common model sizes across quantization levels. Add 1-2 GB for KV cache overhead at 4K context, or 4-8 GB at 32K context.

Model SizeFP16Q8_0Q6_KQ5_K_MQ4_K_M
7-8B16 GB8.5 GB6.6 GB5.7 GB4.9 GB
13-14B28 GB14.8 GB11.5 GB9.9 GB8.5 GB
32-34B68 GB36 GB28 GB24 GB20.5 GB
70-72B140 GB74 GB57 GB49 GB42 GB
120-140B (e.g. Qwen 3 235B MoE active)240+ GB127 GB98 GB84 GB72 GB

Quantization Quality Impact (Perplexity Benchmarks)

Perplexity measures how "surprised" a model is by text -- lower is better. These numbers are from Llama 3.1 70B evaluated on WikiText-2:

QuantizationBits/WeightPerplexityDelta vs FP16Verdict
FP16 (baseline)16.03.12--Reference quality
Q8_08.53.13+0.3%Virtually lossless
Q6_K6.63.15+1.0%Negligible loss
Q5_K_M5.73.19+2.2%Minor loss, great balance
Q4_K_M4.83.28+5.1%Noticeable on reasoning tasks
Q3_K_M3.93.52+12.8%Visible degradation

Pro tip: Q5_K_M is the sweet spot for most users. It preserves 98% of model quality at roughly one-third the VRAM of FP16. Drop to Q4_K_M only when you need the extra headroom for longer context windows or a model that barely doesn't fit at Q5.

GPUs Tested: Specifications

GPUVRAMBandwidthFP16 TFLOPSTDPStreet Price (2026)
RTX 4060 8GB8 GB GDDR6272 GB/s15.1115W$260
RTX 4070 Ti Super 16GB16 GB GDDR6X672 GB/s44.1285W$700
RTX 4090 24GB24 GB GDDR6X1,008 GB/s82.6450W$1,500
RTX 5090 32GB32 GB GDDR71,792 GB/s104.8575W$2,000
RTX 6000 Ada 48GB48 GB GDDR6960 GB/s91.1300W$5,800
A100 80GB80 GB HBM2e2,039 GB/s312300W$8,500 (used)
H100 80GB80 GB HBM33,350 GB/s989700W$22,000

Tokens/sec Benchmarks

All benchmarks use llama.cpp with a 2048-token prompt and 256-token generation. Numbers represent decode speed (tokens per second during generation). Tested with Llama 3.1 models at Q4_K_M quantization unless noted.

8B Model (Llama 3.1 8B Q4_K_M -- 4.9 GB)

GPUTokens/secTTFTVRAM UsedNotes
RTX 4060 8GB52 t/s0.18s5.8 GBFits with room for 8K context
RTX 4070 Ti Super89 t/s0.09s5.8 GBPlenty of headroom
RTX 4090118 t/s0.06s5.8 GBOverkill for 8B
RTX 5090156 t/s0.04s5.8 GBBandwidth advantage shows
A100 80GB142 t/s0.05s5.8 GBOlder tensor cores
H100 80GB195 t/s0.03s5.8 GBFastest single-GPU

70B Model (Llama 3.1 70B Q4_K_M -- 42 GB)

GPUTokens/secTTFTVRAM UsedNotes
RTX 4060 8GB------Does not fit
RTX 4070 Ti Super------Does not fit
RTX 4090------Does not fit (24 GB)
RTX 5090------Does not fit (32 GB)
RTX 6000 Ada 48GB18 t/s1.8s44 GBTight fit, short context only
A100 80GB32 t/s0.9s44 GBComfortable with 32K context
H100 80GB51 t/s0.5s44 GBProduction-grade speed
2x RTX 4090 (tensor parallel)26 t/s1.4s22 GB eachSplit across PCIe -- see multi-GPU section
2x RTX 5090 (tensor parallel)38 t/s0.8s21 GB eachBetter bandwidth helps

13B Model (Llama 3.1 14B Q4_K_M -- 8.5 GB)

GPUTokens/secTTFTVRAM Used
RTX 4060 8GB----Does not fit (needs ~10 GB with KV cache)
RTX 4070 Ti Super62 t/s0.14s10.2 GB
RTX 409084 t/s0.08s10.2 GB
RTX 5090112 t/s0.05s10.2 GB
A100 80GB98 t/s0.06s10.2 GB
H100 80GB138 t/s0.04s10.2 GB

Watch out: VRAM reported by nvidia-smi includes driver overhead and framework allocations. A 24 GB card typically has 23.5 GB usable. Always test with your exact model file and context length before committing to a purchase.

Multi-GPU: Dual RTX 4090 Tensor Parallelism

When a model doesn't fit on a single card, tensor parallelism splits the model across multiple GPUs. Each GPU holds a portion of every layer, and they communicate intermediate results during each forward pass. This works well but has caveats.

Using vLLM with tensor parallelism on 2x RTX 4090 (48 GB total VRAM):

ModelQuantTokens/secvs Single A100Notes
Llama 3.1 70BQ4_K_M26 t/s81%PCIe 4.0 x16 interconnect is the bottleneck
Llama 3.1 70BQ5_K_M----Does not fit (49 GB > 48 GB usable)
Qwen 2.5 32BQ4_K_M48 t/sN/AFits on single 4090, TP unnecessary
Mixtral 8x7B (47B total)Q4_K_M35 t/sN/AMoE architecture benefits from split

The dual 4090 setup costs around $3,000 for the GPUs alone, plus a motherboard and PSU that can handle 900W of GPU power draw. An NVLink bridge is not available on consumer cards, so all inter-GPU communication uses PCIe, which caps at 32 GB/s per direction on PCIe 4.0. This is why the dual 4090 reaches only 81% of A100 throughput on 70B despite having comparable total bandwidth on paper.

Pro tip: If you're building a dual-GPU rig for LLMs, prioritize a motherboard with two PCIe 5.0 x16 slots. PCIe 5.0 doubles the interconnect bandwidth to 64 GB/s per direction, which narrows the gap with data center NVLink setups significantly.

Apple Silicon M-Series

Apple Silicon uses unified memory shared between CPU and GPU cores. This means the "VRAM" limit is your total system RAM -- a Mac Studio with 192 GB of unified memory can load a 70B FP16 model entirely. The trade-off is lower memory bandwidth compared to dedicated GPUs.

Apple SiliconUnified MemoryBandwidth8B Q4 t/s70B Q4 t/sSystem Price
M2 Max (32GB)32 GB400 GB/s38 t/s--$2,500
M3 Max (64GB)64 GB400 GB/s42 t/s5.8 t/s$3,200
M4 Max (128GB)128 GB546 GB/s55 t/s9.2 t/s$4,000
M4 Ultra (192GB)192 GB819 GB/s72 t/s14.5 t/s$5,600

Apple Silicon is compelling when you need large VRAM capacity in a quiet, power-efficient form factor. The M4 Ultra at 192 GB runs 70B models at Q8_0 quality with room for 32K context -- something that would require an A100 or dual-GPU setup on the NVIDIA side. The downside is raw throughput: the M4 Ultra at 14.5 t/s on 70B Q4 is about half the speed of a single H100.

CPU-Only Inference with AVX-512

If you have no GPU budget, modern CPUs with AVX-512 support can run smaller models at usable speeds. This path makes sense for development, testing, and low-traffic internal tools -- not production serving.

CPURAM ConfigBandwidth8B Q4 t/s14B Q4 t/s
AMD EPYC 9654 (96-core)DDR5-4800 12-ch460 GB/s32 t/s22 t/s
Intel Xeon w9-3595X (60-core)DDR5-5600 8-ch358 GB/s26 t/s17 t/s
AMD Ryzen 9 9950X (16-core)DDR5-6000 2-ch96 GB/s19 t/s11 t/s
Intel Core i7-14700K (20-core)DDR5-5600 2-ch89 GB/s15 t/s9 t/s

The key insight: memory bandwidth per dollar is terrible on CPU compared to GPU. A Ryzen 9 system at $1,200 gives you 96 GB/s. An RTX 4060 at $260 gives you 272 GB/s. CPU-only makes sense when the model fits in system RAM but not in any GPU you own, or when you're in an environment where discrete GPUs are unavailable (VPS, shared servers, laptops without dGPU).

Cost-per-Token vs API Break-Even Analysis

Local inference has a fixed hardware cost and ongoing electricity cost. API calls have zero upfront cost and a per-token price. The break-even depends on your monthly token volume.

SetupHardware CostMonthly PowerTokens/sec (70B Q4)Cost per 1M TokensBreak-Even vs API (months)
RTX 4090 (single, 32B model)$1,500$3548 t/s (32B)$0.002~1 month at 100K tokens/day
2x RTX 4090 (70B model)$3,500$6526 t/s$0.005~2 months at 100K tokens/day
RTX 5090 (single, 32B model)$2,000$4265 t/s (32B)$0.001~1 month at 100K tokens/day
A100 80GB (used)$8,500$5032 t/s$0.003~5 months at 100K tokens/day
M4 Ultra Mac Studio$5,600$1214.5 t/s$0.008~4 months at 100K tokens/day
Cloud API (Llama 70B via Together)$0$0N/A$0.90--
Cloud API (GPT-4o)$0$0N/A$7.50 (blended)--

At 100K tokens per day, a local RTX 4090 running a 32B model pays for itself in about one month compared to hosted Llama API pricing. At 10K tokens per day, break-even extends to 10+ months -- API calls are likely cheaper unless you value privacy, latency control, or offline capability.

The Used GPU Market

The used market is where serious value lives for LLM enthusiasts. As crypto mining declined and data centers refreshed hardware, capable GPUs became available at steep discounts. Here's what to look for in mid-2026:

GPUVRAMUsed Price (2026)New PriceSavingsRisk Assessment
RTX 3090 24GB24 GB$550-700$1,500 (launch)60%Medium -- check for mining wear
RTX 4090 24GB24 GB$1,100-1,300$1,60025%Low -- still current gen
A100 40GB40 GB$3,500-4,500$10,000+60%Low -- data center grade
A100 80GB80 GB$7,000-9,000$15,000+45%Low -- check warranty status
Tesla P40 24GB24 GB$200-280N/AN/AHigh -- old arch, no FP16 tensor cores

Watch out: The Tesla P40, while temptingly cheap at $200 for 24 GB of VRAM, lacks FP16 tensor cores. It runs at roughly one-third the speed of an RTX 3090 for LLM inference despite having the same VRAM capacity. The RTX 3090 at $550-700 is almost always the better value on the used market.

Buying Guide: Recommendations by Budget Tier

Under $300: RTX 4060 8GB or Used Tesla P40

At this budget, the RTX 4060 is the best new option. It runs 7-8B models at Q4 with room for reasonable context lengths. You can experiment with llama.cpp, Ollama, and local chatbots. The used Tesla P40 ($200-280) offers 24 GB of VRAM, which fits larger models, but its ancient architecture makes inference painfully slow. Get the 4060 unless you specifically need to load 13B+ models and can tolerate 15-20 t/s.

Under $800: RTX 4070 Ti Super 16GB or Used RTX 3090 24GB

This is the sweet spot tier. The RTX 4070 Ti Super at $700 runs 13-14B models comfortably at Q4_K_M with 16K+ context. A used RTX 3090 at $550-700 gives you 24 GB -- enough for 32-34B models at Q4. The 3090 is slower per token than the 4070 Ti Super on models that fit in both, but it wins on models that only fit in 24 GB. If you want to run the largest models that fit on a single consumer card, get the used 3090. For faster inference on 13B and below, get the 4070 Ti Super.

Under $2,000: RTX 4090 24GB or RTX 5090 32GB

The RTX 4090 at $1,500 (new or lightly used) is the most popular enthusiast LLM card. It handles 32B models at Q4 comfortably, and 34B models at Q5_K_M. The RTX 5090 at $2,000 adds 8 GB more VRAM and 78% more memory bandwidth, which translates to meaningfully faster generation on every model. If buying new, the 5090 is worth the $500 premium. If buying used, the 4090 at $1,100 is exceptional value. Either way, this tier runs everything up to 32-34B at high quality with fast generation.

$2,000+: Multi-GPU or Data Center Cards

Once you're above $2,000, you're looking at 70B+ models. Options: dual RTX 4090 ($3,000 for GPUs, ~$4,000 total system), a used A100 80GB ($7,000-9,000), or the RTX 6000 Ada 48GB ($5,800). The dual 4090 offers the best performance per dollar for 70B models on a consumer platform. The A100 is simpler (single card, no tensor parallelism overhead) and gives 80 GB with massive bandwidth. The RTX 6000 Ada sits awkwardly between -- 48 GB is tight for 70B, and its bandwidth is lower than the A100. At this tier, also consider the M4 Ultra Mac Studio ($5,600) for a silent, power-efficient setup that trades raw speed for convenience.

Frequently Asked Questions

Can I run a 70B model on a single RTX 4090?

No. A 70B model at Q4_K_M requires approximately 42 GB of VRAM for the weights alone, plus additional space for the KV cache. The RTX 4090 has 24 GB. You can partially offload layers to system RAM using the -ngl flag in llama.cpp, but this drops throughput from 100+ t/s (fully on GPU) to 8-12 t/s because of the PCIe bandwidth bottleneck. For usable 70B performance, you need either dual 4090s with tensor parallelism, a single 48GB+ card, or an Apple Silicon system with 64+ GB unified memory.

Is the RTX 5090 worth it over the RTX 4090 for LLMs?

Yes, if you're buying new. The 5090 offers 32 GB VRAM (vs 24 GB), 1,792 GB/s bandwidth (vs 1,008 GB/s), and faster tensor cores. This translates to roughly 30-35% faster token generation on models that fit on both cards, and the ability to run 32-34B models at Q5_K_M or Q6_K quality that won't fit on the 4090. The $500 premium is justified. However, if the used 4090 market offers cards at $1,100, the value calculus shifts -- you could buy two used 4090s for the price of one new 5090 and run 70B models via tensor parallelism.

How does AMD compare to NVIDIA for LLM inference?

AMD's ROCm software stack has improved significantly, but it still trails CUDA in ecosystem maturity. The RX 7900 XTX (24 GB, $900) and Instinct MI300X (192 GB HBM3, data center) are capable hardware. llama.cpp supports ROCm, and vLLM has experimental AMD support. In practice, expect 10-20% lower performance than equivalent NVIDIA hardware due to less optimized kernels. The MI300X is genuinely competitive with H100 at the data center level, but for consumer cards, NVIDIA remains the safer choice because of software compatibility.

Should I buy one expensive GPU or two cheaper ones?

One GPU is almost always better if the model fits. Tensor parallelism over PCIe introduces latency per token (each layer requires inter-GPU communication), and the software stack is more complex to configure. Two RTX 4090s reach about 81% of A100 throughput on 70B models despite having comparable total bandwidth, because PCIe interconnect is the bottleneck. Buy two GPUs only when no single affordable GPU has enough VRAM for your target model.

What about used A100s -- are they reliable?

Data center GPUs are designed for continuous operation and typically have more robust power delivery and cooling than consumer cards. A used A100 that ran in a well-cooled data center for two years is likely in better shape than a used RTX 3090 that was mined on at 100% utilization with inadequate airflow. Check the serial number with NVIDIA for warranty status. The 80 GB variant is strongly preferred over the 40 GB for LLM work -- the price difference is 50-60%, but the doubled VRAM capacity is transformative for which models you can run.

How much electricity does local LLM inference cost?

An RTX 4090 under LLM inference load draws approximately 300-350W (below its 450W TDP because inference is less demanding than training). At the US average of $0.16/kWh, running 8 hours daily costs about $13-15/month. A dual 4090 setup doubles that. An H100 at 700W costs about $27/month at 8 hours daily. Apple Silicon is dramatically more efficient -- an M4 Ultra under full load draws about 90W total system power, costing roughly $3.50/month. Power costs are rarely the deciding factor, but they matter for 24/7 serving scenarios.

When should I just use an API instead?

Use an API when: your token volume is under 50K tokens per day (the hardware investment won't pay back quickly), you need access to frontier models like GPT-4o or Claude Opus that can't be self-hosted, latency requirements are loose (APIs add network round-trip time), or you lack the technical inclination to manage local infrastructure. Use local GPUs when: you need data privacy, you process 100K+ tokens daily, you want sub-100ms time-to-first-token, or you're iterating rapidly on prompts and fine-tuning during development.

The Bottom Line

For most practitioners entering local LLM inference in 2026, the decision tree is straightforward. If your budget is under $800, a used RTX 3090 or new RTX 4070 Ti Super unlocks 13-32B models at interactive speeds. If you can spend $1,500-2,000, the RTX 4090 or 5090 handles everything up to 34B at excellent quality. For 70B+, you're looking at multi-GPU setups, data center cards, or Apple Silicon with high unified memory. Match the GPU to the model sizes you actually need -- not aspirational ones -- and factor in quantization. Q4_K_M on a card with enough VRAM beats Q8_0 spilling to system RAM every time.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.