How to Self-Host Private AI Infrastructure — LLM, GPU, RAG Setup Guide

What you're building

A private LLM inference stack — open-weight models running on your GPU hardware, serving your team via an OpenAI-compatible API. Your data never leaves your environment.

This guide covers a production-ready setup for a team of 10–100 users with mixed workloads: document Q&A, writing assistance, code review, and summarization.

What you need before you start

Hardware A server with a modern NVIDIA GPU. Minimum viable: RTX 4090 (24GB VRAM) for 7B–13B models. Practical for teams: A100 40GB or A100 80GB.

VRAM is the constraint. Everything else is secondary.

Software prerequisites

Linux (Ubuntu 22.04 or AlmaLinux 9 recommended)
NVIDIA drivers + CUDA toolkit
Docker and Docker Compose
Basic familiarity with Linux CLI

Network The inference server needs to be reachable by your team. Private VLAN is ideal. HTTPS termination in front of the API endpoint — never expose raw vLLM/Ollama to the internet.

Model selection

This is the decision that matters most. Wrong model choice means poor results or hardware you can't afford.

Use case	Recommended model	Min VRAM
General assistant	Llama 3.1 8B	6GB (Q4)
Long documents	Qwen 2.5 14B	10GB (Q4)
Code assistance	Codestral 22B or DeepSeek Coder 33B	14GB (Q4)
High quality general	Llama 3.1 70B	42GB (Q4_K_M)
Best available	Llama 3.1 405B	~240GB (Q4)

For most teams: Qwen 2.5 32B at Q4 quantization on a single A100 80GB. Good enough for almost every business task, fast enough for interactive use, fits comfortably with room for context.

Inference engine choice

Ollama — install in minutes, model management built in, works immediately. Use it for: evaluation, single-user setups, development. Weakness: lower throughput under concurrent load.

vLLM — production inference engine. PagedAttention means efficient memory use under load. OpenAI-compatible API out of the box. Use it for: team deployments, anything with more than 3–5 concurrent users.

For a team deployment, start with Ollama to validate your model choice, then migrate to vLLM for production.

Interface: Open WebUI

Open WebUI gives your team a ChatGPT-like interface connected to your Ollama or vLLM backend. Deploy via Docker:

docker run -d -p 3000:80   -e OLLAMA_BASE_URL=http://your-ollama-host:11434   -v open-webui:/app/backend/data   --name open-webui   ghcr.io/open-webui/open-webui:main

Put Caddy or Nginx in front with HTTPS. Connect your SSO (Keycloak/Authentik) via OIDC.

RAG: connecting your documents

RAG (Retrieval Augmented Generation) lets the AI answer questions based on your actual documents — contracts, manuals, case files — rather than just its training data.

Components:

Embedding model — converts documents to vectors. Run locally: nomic-embed-text via Ollama.
Vector database — stores and searches vectors. Qdrant is the best self-hosted option.
Orchestration — AnythingLLM handles this end-to-end with a decent UI. For more control: build with LangChain or LlamaIndex.

AnythingLLM is the fastest path to working RAG. Connect it to Ollama, point it at Qdrant, upload your documents.

AI Gateway: LiteLLM

If you want centralized key management, usage tracking, and the ability to route to external models as fallback — run LiteLLM in front of everything:

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-...
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3.1
      api_base: http://ollama:11434

One endpoint. Virtual keys per team. Cost tracking. Fallback routing. Your developers use one API regardless of what's running behind it.

What breaks

Deliverability issues are rare but real — quantized models sometimes produce worse results on specific tasks than expected. Test with your actual use cases before committing to a model.

VRAM exhaustion — running too many concurrent requests on undersized hardware causes OOM errors. vLLM handles this more gracefully than Ollama via PagedAttention.

Context window — some tasks (long document summarization) require large context windows. 128k context at 70B uses significantly more VRAM than 8k context.

SSO integration — Open WebUI OIDC integration requires correct redirect URL configuration. Budget 2–3 hours if you haven't done this before.

Honest cost breakdown

Hardware (one-time)

RTX 4090 server: €4,000–6,000
A100 40GB server: €12,000–18,000
A100 80GB server: €20,000–30,000

Ongoing

Power: 300–500W continuous = €50–80/month at EU electricity rates
Maintenance: 2–4 hours/month for updates, monitoring, occasional incidents
Your time for initial setup: 20–40 hours including testing and tuning

Break-even vs. OpenAI API A team of 20 heavy users costs €800–1,500/month on OpenAI API. A well-sized private deployment pays for itself in 12–18 months on hardware alone — before counting the compliance and privacy value.

Or let PILOT run it

If you've read this and decided you'd rather have someone else manage the GPU, the models, the updates, and the 3am incidents — that's what we do.

Fixed monthly cost. EU jurisdiction. Your models, your data, our problem.

Request access →