What you're building
A private LLM inference stack — open-weight models running on your GPU hardware, serving your team via an OpenAI-compatible API. Your data never leaves your environment.
This guide covers a production-ready setup for a team of 10–100 users with mixed workloads: document Q&A, writing assistance, code review, and summarization.
What you need before you start
Hardware A server with a modern NVIDIA GPU. Minimum viable: RTX 4090 (24GB VRAM) for 7B–13B models. Practical for teams: A100 40GB or A100 80GB.
VRAM is the constraint. Everything else is secondary.
Software prerequisites
- Linux (Ubuntu 22.04 or AlmaLinux 9 recommended)
- NVIDIA drivers + CUDA toolkit
- Docker and Docker Compose
- Basic familiarity with Linux CLI
Network The inference server needs to be reachable by your team. Private VLAN is ideal. HTTPS termination in front of the API endpoint — never expose raw vLLM/Ollama to the internet.
Model selection
This is the decision that matters most. Wrong model choice means poor results or hardware you can't afford.
| Use case | Recommended model | Min VRAM |
|---|---|---|
| General assistant | Llama 3.1 8B | 6GB (Q4) |
| Long documents | Qwen 2.5 14B | 10GB (Q4) |
| Code assistance | Codestral 22B or DeepSeek Coder 33B | 14GB (Q4) |
| High quality general | Llama 3.1 70B | 42GB (Q4_K_M) |
| Best available | Llama 3.1 405B | ~240GB (Q4) |
For most teams: Qwen 2.5 32B at Q4 quantization on a single A100 80GB. Good enough for almost every business task, fast enough for interactive use, fits comfortably with room for context.
Inference engine choice
Ollama — install in minutes, model management built in, works immediately. Use it for: evaluation, single-user setups, development. Weakness: lower throughput under concurrent load.
vLLM — production inference engine. PagedAttention means efficient memory use under load. OpenAI-compatible API out of the box. Use it for: team deployments, anything with more than 3–5 concurrent users.
For a team deployment, start with Ollama to validate your model choice, then migrate to vLLM for production.
Interface: Open WebUI
Open WebUI gives your team a ChatGPT-like interface connected to your Ollama or vLLM backend. Deploy via Docker:
docker run -d -p 3000:80 -e OLLAMA_BASE_URL=http://your-ollama-host:11434 -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main
Put Caddy or Nginx in front with HTTPS. Connect your SSO (Keycloak/Authentik) via OIDC.
RAG: connecting your documents
RAG (Retrieval Augmented Generation) lets the AI answer questions based on your actual documents — contracts, manuals, case files — rather than just its training data.
Components:
- Embedding model — converts documents to vectors. Run locally: nomic-embed-text via Ollama.
- Vector database — stores and searches vectors. Qdrant is the best self-hosted option.
- Orchestration — AnythingLLM handles this end-to-end with a decent UI. For more control: build with LangChain or LlamaIndex.
AnythingLLM is the fastest path to working RAG. Connect it to Ollama, point it at Qdrant, upload your documents.
AI Gateway: LiteLLM
If you want centralized key management, usage tracking, and the ability to route to external models as fallback — run LiteLLM in front of everything:
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: sk-...
- model_name: local-llama
litellm_params:
model: ollama/llama3.1
api_base: http://ollama:11434
One endpoint. Virtual keys per team. Cost tracking. Fallback routing. Your developers use one API regardless of what's running behind it.
What breaks
Deliverability issues are rare but real — quantized models sometimes produce worse results on specific tasks than expected. Test with your actual use cases before committing to a model.
VRAM exhaustion — running too many concurrent requests on undersized hardware causes OOM errors. vLLM handles this more gracefully than Ollama via PagedAttention.
Context window — some tasks (long document summarization) require large context windows. 128k context at 70B uses significantly more VRAM than 8k context.
SSO integration — Open WebUI OIDC integration requires correct redirect URL configuration. Budget 2–3 hours if you haven't done this before.
Honest cost breakdown
Hardware (one-time)
- RTX 4090 server: €4,000–6,000
- A100 40GB server: €12,000–18,000
- A100 80GB server: €20,000–30,000
Ongoing
- Power: 300–500W continuous = €50–80/month at EU electricity rates
- Maintenance: 2–4 hours/month for updates, monitoring, occasional incidents
- Your time for initial setup: 20–40 hours including testing and tuning
Break-even vs. OpenAI API A team of 20 heavy users costs €800–1,500/month on OpenAI API. A well-sized private deployment pays for itself in 12–18 months on hardware alone — before counting the compliance and privacy value.
Or let PILOT run it
If you've read this and decided you'd rather have someone else manage the GPU, the models, the updates, and the 3am incidents — that's what we do.
Fixed monthly cost. EU jurisdiction. Your models, your data, our problem.