PILOT Technology Solutions
  • Missions
  • Services
  • Pilot Book
  • About
  • Contact
PILOT BOOK GUIDES / OPERATIONAL NOTES

Pilot Book: Private AI Infrastructure

How to deploy private LLM inference. What it actually costs. What breaks.

What is inside

  • Setup guides and real-world implementation notes
  • Tradeoffs, costs, and deployment assumptions
  • Enough detail to build, not just admire

Contents

  • Overview
  • Requirements
  • Installation
  • Cost Analysis
  • Why Fly with Us

What you're building

A private LLM inference stack — open-weight models running on your GPU hardware, serving your team via an OpenAI-compatible API. Your data never leaves your environment.

This guide covers a production-ready setup for a team of 10–100 users with mixed workloads: document Q&A, writing assistance, code review, and summarization.


What you need before you start

Hardware A server with a modern NVIDIA GPU. Minimum viable: RTX 4090 (24GB VRAM) for 7B–13B models. Practical for teams: A100 40GB or A100 80GB.

VRAM is the constraint. Everything else is secondary.

Software prerequisites

  • Linux (Ubuntu 22.04 or AlmaLinux 9 recommended)
  • NVIDIA drivers + CUDA toolkit
  • Docker and Docker Compose
  • Basic familiarity with Linux CLI

Network The inference server needs to be reachable by your team. Private VLAN is ideal. HTTPS termination in front of the API endpoint — never expose raw vLLM/Ollama to the internet.


Model selection

This is the decision that matters most. Wrong model choice means poor results or hardware you can't afford.

Use case Recommended model Min VRAM
General assistant Llama 3.1 8B 6GB (Q4)
Long documents Qwen 2.5 14B 10GB (Q4)
Code assistance Codestral 22B or DeepSeek Coder 33B 14GB (Q4)
High quality general Llama 3.1 70B 42GB (Q4_K_M)
Best available Llama 3.1 405B ~240GB (Q4)

For most teams: Qwen 2.5 32B at Q4 quantization on a single A100 80GB. Good enough for almost every business task, fast enough for interactive use, fits comfortably with room for context.


Inference engine choice

Ollama — install in minutes, model management built in, works immediately. Use it for: evaluation, single-user setups, development. Weakness: lower throughput under concurrent load.

vLLM — production inference engine. PagedAttention means efficient memory use under load. OpenAI-compatible API out of the box. Use it for: team deployments, anything with more than 3–5 concurrent users.

For a team deployment, start with Ollama to validate your model choice, then migrate to vLLM for production.


Interface: Open WebUI

Open WebUI gives your team a ChatGPT-like interface connected to your Ollama or vLLM backend. Deploy via Docker:

docker run -d -p 3000:80   -e OLLAMA_BASE_URL=http://your-ollama-host:11434   -v open-webui:/app/backend/data   --name open-webui   ghcr.io/open-webui/open-webui:main

Put Caddy or Nginx in front with HTTPS. Connect your SSO (Keycloak/Authentik) via OIDC.


RAG: connecting your documents

RAG (Retrieval Augmented Generation) lets the AI answer questions based on your actual documents — contracts, manuals, case files — rather than just its training data.

Components:

  • Embedding model — converts documents to vectors. Run locally: nomic-embed-text via Ollama.
  • Vector database — stores and searches vectors. Qdrant is the best self-hosted option.
  • Orchestration — AnythingLLM handles this end-to-end with a decent UI. For more control: build with LangChain or LlamaIndex.

AnythingLLM is the fastest path to working RAG. Connect it to Ollama, point it at Qdrant, upload your documents.


AI Gateway: LiteLLM

If you want centralized key management, usage tracking, and the ability to route to external models as fallback — run LiteLLM in front of everything:

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-...
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3.1
      api_base: http://ollama:11434

One endpoint. Virtual keys per team. Cost tracking. Fallback routing. Your developers use one API regardless of what's running behind it.


What breaks

Deliverability issues are rare but real — quantized models sometimes produce worse results on specific tasks than expected. Test with your actual use cases before committing to a model.

VRAM exhaustion — running too many concurrent requests on undersized hardware causes OOM errors. vLLM handles this more gracefully than Ollama via PagedAttention.

Context window — some tasks (long document summarization) require large context windows. 128k context at 70B uses significantly more VRAM than 8k context.

SSO integration — Open WebUI OIDC integration requires correct redirect URL configuration. Budget 2–3 hours if you haven't done this before.


Honest cost breakdown

Hardware (one-time)

  • RTX 4090 server: €4,000–6,000
  • A100 40GB server: €12,000–18,000
  • A100 80GB server: €20,000–30,000

Ongoing

  • Power: 300–500W continuous = €50–80/month at EU electricity rates
  • Maintenance: 2–4 hours/month for updates, monitoring, occasional incidents
  • Your time for initial setup: 20–40 hours including testing and tuning

Break-even vs. OpenAI API A team of 20 heavy users costs €800–1,500/month on OpenAI API. A well-sized private deployment pays for itself in 12–18 months on hardware alone — before counting the compliance and privacy value.


Or let PILOT run it

If you've read this and decided you'd rather have someone else manage the GPU, the models, the updates, and the 3am incidents — that's what we do.

Fixed monthly cost. EU jurisdiction. Your models, your data, our problem.

Request access →

PILOT PM / OPERATIONS

Built for sovereign delivery, clear handoff, and repeatable deployments.

This site is structured to keep the brand, the content, and the operational layers visually aligned.

Company

About Contact Pilot Book

Services

Infrastructure Mail Cloud

Mission

Missions AI / ML Developer

Resources

Stack Tower GPU

© 2026 PILOT Technology Solutions. All rights reserved.

Selected for teams that need the work done, not just documented.