Covers rebuilding and rejoining a failed or replaced cluster node. Assumes the other node is healthy and running the full workload. Pacemaker service files (ha-*.service) do not need to be rebuilt — they are copied from the surviving node.
Before starting, set these values:
- Surviving node:
thing-1← change me- Node being replaced:
thing-2← change me- Surviving node IP:
192.168.1.21← change me- Replaced node IP:
192.168.1.22← change me- jgmelon IP (private registry):
192.168.1.X← change me
On the surviving node, confirm cluster is healthy and running solo:
sudo pcs status
sudo drbdadm status navidrome
Verify all services are running on thing-1. thing-2 should be Stopped or absent.
Boot from AlmaLinux 9 ISO. Use Custom Partitioning — do NOT use automatic.
| Mount | Size | Type |
|---|---|---|
| /boot/efi | 600MB | FAT32 |
| /boot | 1GB | XFS |
| (LVM PV) | remaining | LVM |
LVM Volume Group name: almalinux
| LV | Size |
|---|---|
| root (/) | 50GB |
| var (/var) | 200GB |
| home (/home) | 20GB |
| swap | 16GB |
Leave remaining space unallocated — allocate later with lvextend as needed.
When partitioning, ensure GPT partition table is selected. MBR has a 2TB limit and wastes the first 1TB on a 1.9TB disk if the partition table is laid out wrong.
sudo parted /dev/sda resizepart 3 100%
sudo pvresize /dev/sda3
sudo pvs # verify PFree shows full available space
ip link show
# enp2s0f0 = 10GbE external (main network)
# enp2s0f1 = 10GbE interconnect (DRBD + cluster heartbeat)
Verify with ethtool enp2s0f0 | grep Speed — should show 10000Mb/s.
sudo nmcli connection add type ethernet \
con-name 10g-external \
ifname enp2s0f0 \
ipv4.addresses 192.168.1.22/24 \
ipv4.gateway 192.168.1.1 \
ipv4.dns "192.168.1.1" \
ipv4.method manual \
connection.autoconnect yes
sudo nmcli connection up 10g-external
sudo nmcli connection add type ethernet \
con-name 10g-interconnect \
ifname enp2s0f1 \
ipv4.addresses 10.0.0.22/24 \
ipv4.method manual \
802-3-ethernet.mtu 9000 \
connection.autoconnect yes
sudo nmcli connection up 10g-interconnect
ip -br addr show
ping -c 2 192.168.1.1 # gateway
ping -c 2 192.168.1.21 # thing-1
ping -c 2 10.0.0.21 # thing-1 interconnect
On thing-1:
iperf3 -s
On thing-2:
iperf3 -c 10.0.0.21
# Expect ~9.9 Gbits/sec
sudo hostnamectl set-hostname thing-2
Before filling in /etc/hosts, grab the jgmelon IP from
thing-1:grep jgmelon /etc/hosts
sudo tee /etc/hosts << 'EOF'
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.21 thing-1 thing-1.customstack.nyc
192.168.1.22 thing-2 thing-2.customstack.nyc
192.168.1.246 synology
192.168.1.X jgmelon # private Docker registry host — get IP from thing-1
EOF
Verify the FQDN — this must match the on clause in /etc/drbd.d/navidrome.res:
uname -n
# Must return: thing-2.customstack.nyc
⚠️ Critical: The hostname returned by
uname -nmust match theonclause in/etc/drbd.d/navidrome.resexactly. Verify before proceeding to DRBD setup.
On thing-2:
ssh-keygen -t ed25519 -C "guernica@thing-2"
ssh-copy-id guernica@thing-1
sudo ssh-keygen -t ed25519 -C "root@thing-2"
sudo ssh-copy-id root@thing-1
On thing-1, copy keys back to thing-2:
ssh-copy-id guernica@thing-2
sudo ssh-copy-id root@thing-2
Test both directions:
ssh thing-1 hostname
ssh thing-2 hostname
# EPEL and ELRepo for DRBD
sudo dnf install -y epel-release
sudo dnf install -y https://www.elrepo.org/elrepo-release-9.el9.elrepo.noarch.rpm
# Core packages
sudo dnf install -y \
pcs pacemaker corosync fence-agents-all \
nfs-utils \
drbd-utils drbd-selinux \
container-selinux \
sqlite \
iperf3
# Docker
sudo dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
sudo dnf install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
sudo systemctl enable --now docker
sudo usermod -aG docker guernica
ls -Z /usr/bin/dockerd
# Must show: container_runtime_exec_t
# If it shows bin_t, run: sudo restorecon -v /usr/bin/dockerd
⚠️ Do this before attempting to join the cluster.
pcs host authwill fail with a confusing error if the hacluster password isn't set first.
sudo passwd hacluster
# Use the same password as on thing-1
sudo systemctl enable --now pcsd
sudo mkdir -p /etc/docker
sudo tee /etc/docker/daemon.json << 'EOF'
{
"insecure-registries": ["jgmelon:5002"]
}
EOF
sudo systemctl restart docker
sudo docker info | grep -A 3 "Insecure Registries"
sudo docker network create webnet
sudo docker network ls | grep webnet
sudo mkdir -p /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots
sudo systemctl restart containerd
sudo firewall-cmd --permanent --add-service=high-availability
sudo firewall-cmd --permanent --add-service=http
sudo firewall-cmd --permanent --add-service=https
sudo firewall-cmd --permanent --add-service=imap
sudo firewall-cmd --permanent --add-service=imaps
sudo firewall-cmd --permanent --add-service=smtp
sudo firewall-cmd --permanent --add-service=smtps
sudo firewall-cmd --permanent --add-port=587/tcp
sudo firewall-cmd --permanent --add-port=4190/tcp
sudo firewall-cmd --permanent --add-port=5201/tcp
sudo firewall-cmd --permanent --add-port=27018/tcp
sudo firewall-cmd --permanent --add-port=4533/tcp
sudo firewall-cmd --permanent --add-port=7789/tcp # DRBD
sudo firewall-cmd --reload
sudo firewall-cmd --list-all
sudo mkdir -p /mnt/ha-shared
sudo mkdir -p /senuti
sudo mkdir -p /var/lib/navidrome
# DO NOT add these to /etc/fstab — Pacemaker manages them
Before handing off to Pacemaker, verify NFS works standalone:
showmount -e 192.168.1.246
# Should list the exports
sudo mount -t nfs 192.168.1.246:/volume1/ha-shared /mnt/ha-shared
ls /mnt/ha-shared
# Verify contents are visible, then unmount — Pacemaker will remount
sudo umount /mnt/ha-shared
⚠️ Don't skip this. If NFS is broken, Pacemaker will fail to mount and the resource group won't start — and the error won't always make the NFS problem obvious.
echo -e "drbd\ndrbd_transport_tcp" | sudo tee /etc/modules-load.d/drbd.conf
sudo modprobe drbd
sudo modprobe drbd_transport_tcp
lsmod | grep drbd
sudo lvcreate -L 20G -n navidrome almalinux
sudo lvcreate -L 1G -n navidrome-meta almalinux
lsblk | grep navidrome
sudo scp thing-1:/etc/drbd.d/navidrome.res /etc/drbd.d/navidrome.res
cat /etc/drbd.d/navidrome.res
Verify the on clauses use FQDNs matching uname -n on each node:
on thing-1.customstack.nyc { ... }
on thing-2.customstack.nyc { ... }
sudo drbdadm create-md navidrome
sudo drbdadm up navidrome
sudo drbdadm status navidrome
# Should show: Connecting or SyncTarget — thing-1 will push data
sudo scp thing-1:/etc/systemd/system/ha-*.service /etc/systemd/system/
sudo systemctl daemon-reload
ls /etc/systemd/system/ha-*.service
sudo systemctl cat ha-navidrome
sudo systemctl cat ha-web-containers
sudo systemctl cat ha-nginx-proxy
# Each should show WorkingDirectory pointing to /mnt/ha-shared/...
From thing-1:
# Authenticate the new node
sudo pcs host auth thing-2 -u hacluster
# Sync cluster config to thing-2
sudo pcs cluster sync
# Start cluster on thing-2
sudo pcs cluster start thing-2
sudo pcs cluster enable thing-2
On thing-2, verify both nodes are online:
sudo pcs status
sudo pcs status corosync
# Both nodes should show Online
sudo pcs node unstandby thing-2
sudo pcs status
sudo pcs resource cleanup
sudo pcs status
# Always run this after rejoining — Pacemaker won't retry until failure history is cleared
watch sudo drbdadm status navidrome
# Wait for: disk:UpToDate on both nodes
# Syncing 20GB over 10GbE takes a few minutes
Do this after NFS is mounted and DRBD is synced. Pre-pulling prevents Pacemaker start timeouts on first failover to this node.
cd /mnt/ha-shared/web-containers && sudo docker compose pull
cd /mnt/ha-shared/nginx-proxy && sudo docker compose pull
cd /mnt/ha-shared/navidrome && sudo docker compose pull
# Cluster healthy
sudo pcs status
# DRBD synced
sudo drbdadm status navidrome
# Docker working
sudo docker ps
sudo docker network ls | grep webnet
sudo docker info | grep -A 3 "Insecure Registries"
# NFS reachable
showmount -e 192.168.1.246
# SELinux
sudo journalctl -t setroubleshoot --no-pager | tail -5
Once thing-2 is healthy, test a failover TO it:
# Put thing-1 in standby
sudo pcs node standby thing-1
# Watch everything migrate
watch sudo pcs status
# Verify site is up
curl -I https://ipodrepair.nyc
# Fail back
sudo pcs node unstandby thing-1
DRBD resource not found in configuration file
sudo modprobe drbd && sudo modprobe drbd_transport_tcpuname -n must match on clause in navidrome.res exactlynetwork webnet declared as external, but could not be found
sudo docker network create webnet
permission denied on Docker socket even with sudo
container-selinux not installed or dockerd has wrong labelsudo dnf install -y container-selinux
sudo restorecon -v /usr/bin/dockerd
sudo systemctl restart docker
failed to create temp dir: stat .../snapshots: no such file or directory
sudo mkdir -p /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots
sudo systemctl restart containerd
Resources fail on new node, succeed manually
sudo pcs resource cleanup after fixing issues manuallypcsd not starting on boot
sudo systemctl enable pcsd
| Node | Interface | IP | Purpose |
|---|---|---|---|
| thing-1 | enp2s0f0 | 192.168.1.21 | Main network |
| thing-1 | enp2s0f1 | 10.0.0.21 | DRBD / heartbeat |
| thing-2 | enp2s0f0 | 192.168.1.22 | Main network |
| thing-2 | enp2s0f1 | 10.0.0.22 | DRBD / heartbeat |
| Service | Port |
|---|---|
| DRBD | 7789/tcp |
| pcsd | 2224/tcp |
| Corosync | 5403/udp |
| Navidrome | 4533/tcp |
| MongoDB | 27018/tcp |