Self-hosted GPU comparisons
Cost per token for running LLMs locally, across RTX 4090, 5090, 5080, and 20+ other GPUs. When self-hosting becomes the cheaper answer.
Article body
Why self-hosted?
Running your own AI locally gives you control over data, predictable costs, and no per-token API bills. The trade-off: upfront hardware cost and power. This guide compares GPUs by cost per million tokens, so you can see which card fits your volume and budget.
Assumptions
All figures assume 24/7 operation for 1 year, UK electricity at 26.35 p/kWh, and GPU TDP plus 100W system overhead. "Full cost" = Year 1 only; "3yr amort" = hardware spread over 3 years.
RTX 4090: lab reference
Our lab testing equipment: NVIDIA GeForce RTX 4090 (24GB VRAM), 450W TDP, Driver 590.48.01 / CUDA 13.1. £1,800 RRP, 550W system draw, £1,269/year power.
| Tier | Memory | Tok/s | Year 1 total | £/1M (full) | £/1M (3yr amort) |
|---|---|---|---|---|---|
| Large (70B) | 24 GB | — | — | Cannot run 70B | |
| Medium (14B) | 24 GB | 69 | £3,069 | £1.41 | £0.86 |
| Small (8B) | 24 GB | 104 | £3,069 | £0.94 | £0.57 |
| Optimised (8B) | 24 GB | ~130 | £3,069 | £0.75 | £0.46 |
Best value by model size
- 8B models: RTX 5070 Ti (£0.42/1M amortised) or RTX 5080 (£0.47/1M).
- 14B models: RTX 5070 Ti (£0.64/1M) or RTX 5080 (£0.69/1M).
- 70B models: RTX PRO 6000 WS (£4.01/1M amortised). Only 40GB+ VRAM cards can run 70B.
Full GPU comparison: 8B models
Qwen3 8B at 16K context, 4-bit. Sorted by 3yr amortised cost.
| GPU | Memory | Tok/s | £/1M (full) | £/1M (3yr) |
|---|---|---|---|---|
| RTX 5070 Ti | 16 GB | 88 | £0.60 | £0.42 |
| RTX 5080 | 16 GB | 94 | £0.69 | £0.47 |
| RTX 3080 Ti | 12 GB | 88 | £0.72 | £0.49 |
| RTX 5060 Ti | 16 GB | 51 | £0.65 | £0.48 |
| RTX 3080 10GB | 10 GB | 74 | £0.66 | £0.50 |
| RTX 3090 Ti | 24 GB | 94 | £0.86 | £0.52 |
| RTX 3090 | 24 GB | 87 | £0.82 | £0.52 |
| RTX 5090 | 32 GB | 145 | £0.78 | £0.49 |
| RTX 4090 | 24 GB | 104 | £0.94 | £0.57 |
| RTX 3060 | 12 GB | 42 | £0.66 | £0.53 |
| RTX 4090 48GB | 48 GB | 106 | £1.11 | £0.62 |
| RTX 6000 Ada | 48 GB | 99 | £1.74 | £0.78 |
| RTX PRO 6000 WS | 96 GB | 141 | £1.98 | £0.80 |
70B models: 40GB+ VRAM only
| GPU | Memory | Tok/s | £/1M (full) | £/1M (3yr) |
|---|---|---|---|---|
| RTX PRO 6000 WS | 96 GB | 28 | £9.95 | £4.01 |
| RTX 4090 48GB | 48 GB | 18 | £6.55 | £3.68 |
| RTX 6000 Ada | 48 GB | 14 | £12.28 | £5.49 |
Cloud vs self-hosted
Cloud APIs (GPT-5 mini, Gemini Flash-Lite) run roughly £0.10–0.50/1M tokens. Self-hosted at 9,000+ TPM can be £0.40–0.50/1M, which is competitive for high-volume, batch-friendly workloads. Large 70B models at 2,400 TPM are often more expensive per token than cloud.
The takeaway
Best value for small models (8B): RTX 5070 Ti or RTX 5080. For 14B: same. For 70B: RTX PRO 6000 WS. RTX 4090 sits mid-pack, a solid lab reference.
Sources: Hardware Corner GPU ranking, Cloudrift benchmarks. UK RRP from Scan, Which, bestvaluegpu.com. Last updated Feb 2026.