How to Set Up Proxmox Cluster — ZFS, HA, OPNsense, Backup Guide 2026

What you're building

A Proxmox VE cluster — two or three physical nodes running KVM virtual machines and LXC containers, with shared ZFS storage, high availability, automated backup, and OPNsense handling perimeter security.

This is the foundation that everything else in a sovereign infrastructure stack runs on. Get this right and every other service becomes straightforward to deploy.

Hardware requirements

Minimum viable cluster: 2 nodes

2× servers, each with: 32GB+ RAM, 8+ cores, 2× NVMe SSDs (storage), 1× SSD (OS), 2× network interfaces
A managed switch with VLAN support
A separate, small server or VM for Proxmox Backup Server (or use one of the nodes)

Recommended for production: 3 nodes Three nodes allows proper quorum. With two nodes, a single failure causes the cluster to lose quorum and HA cannot automatically restart VMs. A third node (even a lightweight one) solves this.

Storage note: ZFS on NVMe. Not spinning disk, not SATA SSD. NVMe for VM storage. ZFS mirror (RAID-1 equivalent) across two NVMe drives per node.

Installation

Install Proxmox VE on each node from the official ISO. During installation:

Set a static IP for the management interface
Use a separate disk/partition for the OS — don't put it on your ZFS pool
Set the hostname to something meaningful: pve01.yourdomain.internal

After installation, create the ZFS pool:

# On each node - create mirrored ZFS pool from two NVMe drives
zpool create -f vmdata mirror /dev/nvme0n1 /dev/nvme1n1

Clustering

From the first node, create the cluster:

pvecm create your-cluster-name

From each additional node, join:

pvecm add IP-OF-FIRST-NODE

Verify all nodes are visible in the web UI at https://NODE-IP:8006.

Network requirement: cluster communication (Corosync) needs low-latency, reliable connectivity between nodes. Use a dedicated network interface for this — not the same interface as VM traffic.

High Availability

HA requires the cluster to be able to fence (forcibly power off) nodes that become unresponsive. Without fencing, HA cannot safely restart VMs — it risks running the same VM on two nodes simultaneously (split-brain).

Configure IPMI/iLO/iDRAC on each node. In Proxmox:

Datacenter → HA → Resources: add VMs that should be HA-protected
Datacenter → HA → Fencing: configure your IPMI credentials

Test this: power off a node ungracefully. HA VMs should restart on surviving nodes within 2–3 minutes.

Networking with OPNsense

Deploy OPNsense as a VM on the cluster. This VM is your perimeter firewall and VLAN router.

Assign OPNsense two interfaces:

WAN: connected to your upstream internet
LAN/trunk: connected to a VLAN-capable bridge on the Proxmox nodes

Create VLANs for different traffic types:

Management VLAN: Proxmox nodes, IPMI interfaces
VM VLAN(s): your workload VMs
Storage VLAN: if using shared storage across nodes

OPNsense handles routing between VLANs, NAT for internet access, and perimeter firewall rules. Your VMs never have direct internet access — all traffic goes through OPNsense.

Proxmox Backup Server

PBS is a separate application (not part of Proxmox VE) that handles VM-level backups efficiently.

Deploy PBS on a separate machine or VM. In Proxmox VE, add PBS as a storage backend:

Datacenter → Storage → Add → Proxmox Backup Server
Point at your PBS IP and authenticate

Configure backup jobs:

Schedule: nightly at a low-traffic time
Retention: 7 daily, 4 weekly, 3 monthly (adjust to your storage capacity)
Encryption: enable. PBS encrypts backups client-side before transmission.

PBS deduplicates across backups. Similar VMs that share base images will deduplicate heavily — a realistic ratio is 3–5× storage efficiency versus raw backup size.

What breaks

Quorum loss — with a 2-node cluster, a single node failure loses quorum. The surviving node goes read-only and HA cannot restart VMs. Solution: add a third (lightweight) node or deploy a Corosync QDevice.

ZFS ARC memory pressure — ZFS uses available RAM for its ARC (cache). On systems with many VMs, this can compete with VM RAM allocation. Set zfs_arc_max to limit ARC to a sensible value (e.g., 8GB on a 64GB system).

IPMI connectivity — fencing fails if IPMI is unreachable from the cluster network. Keep IPMI on a separate management VLAN reachable from all nodes.

NTP — Corosync requires all nodes to have synchronized time. Install and configure chrony on all nodes. Cluster communication fails silently with time skew > 1 second.

PBS backup failures — backups fail silently if the PBS storage fills up. Monitor PBS storage usage. Set up TOWER alerts on storage capacity.

Honest cost breakdown

Hardware (2-node cluster)

2× used enterprise servers (HPE DL360 Gen10 or similar): €2,000–4,000 total
4× NVMe drives (2 per node): €400–800
Managed switch: €200–500
Total hardware: €2,600–5,300

Ongoing

Power: 2 servers at 150W average = €50–70/month
Proxmox subscription (optional, for enterprise repo access): €95/year per node
Maintenance: 2–4 hours/month

What you get A platform that can run 20–40 VMs with HA, daily backups, and full network isolation. The same capability costs €3,000–8,000/month on AWS.

What you're trading Hardware failures are your problem. A failed NVMe at 2am is your 2am. Without a monitoring system (TOWER equivalent), you may not know until a VM starts crashing.

Or let PILOT run it

If you'd rather have the cluster managed than spend your evenings replacing drives and debugging Corosync — that's the infrastructure service.

Request access →