What you're building
A Proxmox VE cluster — two or three physical nodes running KVM virtual machines and LXC containers, with shared ZFS storage, high availability, automated backup, and OPNsense handling perimeter security.
This is the foundation that everything else in a sovereign infrastructure stack runs on. Get this right and every other service becomes straightforward to deploy.
Hardware requirements
Minimum viable cluster: 2 nodes
- 2× servers, each with: 32GB+ RAM, 8+ cores, 2× NVMe SSDs (storage), 1× SSD (OS), 2× network interfaces
- A managed switch with VLAN support
- A separate, small server or VM for Proxmox Backup Server (or use one of the nodes)
Recommended for production: 3 nodes Three nodes allows proper quorum. With two nodes, a single failure causes the cluster to lose quorum and HA cannot automatically restart VMs. A third node (even a lightweight one) solves this.
Storage note: ZFS on NVMe. Not spinning disk, not SATA SSD. NVMe for VM storage. ZFS mirror (RAID-1 equivalent) across two NVMe drives per node.
Installation
Install Proxmox VE on each node from the official ISO. During installation:
- Set a static IP for the management interface
- Use a separate disk/partition for the OS — don't put it on your ZFS pool
- Set the hostname to something meaningful: pve01.yourdomain.internal
After installation, create the ZFS pool:
# On each node - create mirrored ZFS pool from two NVMe drives
zpool create -f vmdata mirror /dev/nvme0n1 /dev/nvme1n1
Clustering
From the first node, create the cluster:
pvecm create your-cluster-name
From each additional node, join:
pvecm add IP-OF-FIRST-NODE
Verify all nodes are visible in the web UI at https://NODE-IP:8006.
Network requirement: cluster communication (Corosync) needs low-latency, reliable connectivity between nodes. Use a dedicated network interface for this — not the same interface as VM traffic.
High Availability
HA requires the cluster to be able to fence (forcibly power off) nodes that become unresponsive. Without fencing, HA cannot safely restart VMs — it risks running the same VM on two nodes simultaneously (split-brain).
Configure IPMI/iLO/iDRAC on each node. In Proxmox:
- Datacenter → HA → Resources: add VMs that should be HA-protected
- Datacenter → HA → Fencing: configure your IPMI credentials
Test this: power off a node ungracefully. HA VMs should restart on surviving nodes within 2–3 minutes.
Networking with OPNsense
Deploy OPNsense as a VM on the cluster. This VM is your perimeter firewall and VLAN router.
Assign OPNsense two interfaces:
- WAN: connected to your upstream internet
- LAN/trunk: connected to a VLAN-capable bridge on the Proxmox nodes
Create VLANs for different traffic types:
- Management VLAN: Proxmox nodes, IPMI interfaces
- VM VLAN(s): your workload VMs
- Storage VLAN: if using shared storage across nodes
OPNsense handles routing between VLANs, NAT for internet access, and perimeter firewall rules. Your VMs never have direct internet access — all traffic goes through OPNsense.
Proxmox Backup Server
PBS is a separate application (not part of Proxmox VE) that handles VM-level backups efficiently.
Deploy PBS on a separate machine or VM. In Proxmox VE, add PBS as a storage backend:
- Datacenter → Storage → Add → Proxmox Backup Server
- Point at your PBS IP and authenticate
Configure backup jobs:
- Schedule: nightly at a low-traffic time
- Retention: 7 daily, 4 weekly, 3 monthly (adjust to your storage capacity)
- Encryption: enable. PBS encrypts backups client-side before transmission.
PBS deduplicates across backups. Similar VMs that share base images will deduplicate heavily — a realistic ratio is 3–5× storage efficiency versus raw backup size.
What breaks
Quorum loss — with a 2-node cluster, a single node failure loses quorum. The surviving node goes read-only and HA cannot restart VMs. Solution: add a third (lightweight) node or deploy a Corosync QDevice.
ZFS ARC memory pressure — ZFS uses available RAM for its ARC (cache). On systems with many VMs, this can compete with VM RAM allocation. Set zfs_arc_max to limit ARC to a sensible value (e.g., 8GB on a 64GB system).
IPMI connectivity — fencing fails if IPMI is unreachable from the cluster network. Keep IPMI on a separate management VLAN reachable from all nodes.
NTP — Corosync requires all nodes to have synchronized time. Install and configure chrony on all nodes. Cluster communication fails silently with time skew > 1 second.
PBS backup failures — backups fail silently if the PBS storage fills up. Monitor PBS storage usage. Set up TOWER alerts on storage capacity.
Honest cost breakdown
Hardware (2-node cluster)
- 2× used enterprise servers (HPE DL360 Gen10 or similar): €2,000–4,000 total
- 4× NVMe drives (2 per node): €400–800
- Managed switch: €200–500
- Total hardware: €2,600–5,300
Ongoing
- Power: 2 servers at 150W average = €50–70/month
- Proxmox subscription (optional, for enterprise repo access): €95/year per node
- Maintenance: 2–4 hours/month
What you get A platform that can run 20–40 VMs with HA, daily backups, and full network isolation. The same capability costs €3,000–8,000/month on AWS.
What you're trading Hardware failures are your problem. A failed NVMe at 2am is your 2am. Without a monitoring system (TOWER equivalent), you may not know until a VM starts crashing.
Or let PILOT run it
If you'd rather have the cluster managed than spend your evenings replacing drives and debugging Corosync — that's the infrastructure service.