/etc/ssh/sshd_config)sudo vim /etc/ssh/sshd_config
PermitRootLogin prohibit-password
PubkeyAuthentication yes
PasswordAuthentication no
Apply changes:
sudo sshd -t
sudo systemctl reload sshd
/etc/ssh/ssh_config)sudo vim /etc/ssh/ssh_config
PasswordAuthentication no
PubkeyAuthentication yes
# Generate key
ssh-keygen -t ed25519 -C "guernica@$(hostname)"
# Copy to remote
ssh-copy-id user@remote-host
# For root
sudo ssh-keygen -t ed25519 -C "root@$(hostname)"
sudo ssh-copy-id root@remote-host
ssh user@remote-host hostname
# Add Docker repository
sudo dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
# Install Docker
sudo dnf install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
# Start and enable Docker
sudo systemctl enable --now docker
# Verify installation
docker --version
sudo systemctl status docker
# Add user to docker group (optional, for non-sudo access)
sudo usermod -aG docker guernica
# Log out and back in for group to take effect
# Test Docker
sudo docker run hello-world
For private registry access (both nodes):
# Create or edit Docker daemon config
sudo mkdir -p /etc/docker
sudo vim /etc/docker/daemon.json
Add:
{
"insecure-registries": ["jgmelon:5002"]
}
Apply:
# Restart Docker
sudo systemctl restart docker
# Verify config is loaded
sudo docker info | grep -A 5 "Insecure Registries"
Ensure registry hostname resolves:
# Check /etc/hosts
grep jgmelon /etc/hosts
# If missing, add it
echo "192.168.1.X jgmelon" | sudo tee -a /etc/hosts
Create required networks (both nodes):
# Create webnet - required by web-containers, mongodb, navidrome
sudo docker network create webnet
# Verify
sudo docker network ls | grep webnet
Important: The webnet network must exist before starting HA services. This is a one-time operation per node - the network persists across reboots.
# Check that HA services can find Docker
sudo systemctl cat ha-web-containers | grep ExecStart
sudo systemctl cat ha-mongodb | grep ExecStart
Both nodes must have container-selinux installed. Without it, dockerd will have the wrong SELinux label (bin_t instead of container_runtime_exec_t) and will fail to start or refuse socket connections even with correct permissions.
# Install container SELinux policy (both nodes)
sudo dnf install -y container-selinux
# Verify dockerd has correct label
ls -Z /usr/bin/dockerd
# Should show: system_u:object_r:container_runtime_exec_t:s0
# If label is wrong (bin_t or unlabeled_t), fix it:
sudo restorecon -v /usr/bin/dockerd
# Restart docker after fix
sudo systemctl restart docker
Symptoms of missing container-selinux:
permission denied while trying to connect to the Docker daemon socket even with sudodocker compose fails with socket permission errors despite correct group membershipdockerd fails to start after restartls -Z /usr/bin/dockerd shows bin_t instead of container_runtime_exec_tRoot cause: restorecon on a node without container-selinux installed will label dockerd as generic bin_t since the container policy type doesn't exist yet. Always install container-selinux before running restorecon on Docker binaries.
On fresh nodes, the containerd overlay snapshotter directory may not exist, causing image build failures:
# Error: failed to create temp dir: stat /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots: no such file or directory
# Fix:
sudo mkdir -p /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots
sudo systemctl restart containerd
/etc/hosts or DNScontainer-selinux installed on both nodeswebnet network created on both nodes# Enable HA repository
sudo dnf config-manager --set-enabled highavailability
# Install packages
sudo dnf install -y pcs pacemaker corosync fence-agents-all
# Install NFS client tools
sudo dnf install -y nfs-utils
# Start pcsd
sudo systemctl enable --now pcsd
# Set hacluster password (same on both nodes)
sudo passwd hacluster
# Add high-availability service
sudo firewall-cmd --permanent --add-service=high-availability
# Add mail services (for mailcow)
sudo firewall-cmd --permanent --add-service=http
sudo firewall-cmd --permanent --add-service=https
sudo firewall-cmd --permanent --add-service=imap
sudo firewall-cmd --permanent --add-service=imaps
sudo firewall-cmd --permanent --add-service=pop3
sudo firewall-cmd --permanent --add-service=pop3s
sudo firewall-cmd --permanent --add-service=smtp
sudo firewall-cmd --permanent --add-service=smtps
# Add application ports
sudo firewall-cmd --permanent --add-port=5201/tcp # iperf3
sudo firewall-cmd --permanent --add-port=27018/tcp # mongodb
sudo firewall-cmd --permanent --add-port=587/tcp # smtp submission
sudo firewall-cmd --permanent --add-port=4190/tcp # sieve
sudo firewall-cmd --permanent --add-port=3000/tcp # navidrome
# Add trusted source network (adjust to your needs)
sudo firewall-cmd --permanent --add-source=192.168.10.0/24
# Reload firewall
sudo firewall-cmd --reload
# Verify
sudo firewall-cmd --list-all
# List all listening ports
sudo ss -tulpn
# Check specific cluster ports
sudo ss -tulpn | grep -E "2224|3121|5403|21064"
# List all firewall rules
sudo firewall-cmd --list-all
# Check if specific service is allowed
sudo firewall-cmd --list-services
# Check open ports
sudo firewall-cmd --list-ports
# Create mount points for OCF Filesystem resources
sudo mkdir -p /mnt/ha-shared
sudo mkdir -p /senuti
# DO NOT add to /etc/fstab - Pacemaker manages these mounts
# Authenticate nodes
sudo pcs host auth thing-1 thing-2 -u hacluster
# Create cluster
sudo pcs cluster setup mycluster thing-1 thing-2
# Start cluster
sudo pcs cluster start --all
sudo pcs cluster enable --all
# Check status
sudo pcs status
# Disable STONITH (for testing, enable in production)
sudo pcs property set stonith-enabled=false
# Set quorum policy (for 2-node cluster)
sudo pcs property set no-quorum-policy=ignore
# Create virtual IP resource
sudo pcs resource create cluster_vip ocf:heartbeat:IPaddr2 \
ip=192.168.1.100 cidr_netmask=24 \
op monitor interval=30s
# Check resource status
sudo pcs resource status
What it does:
How it works:
# Pacemaker manages automatically:
# Start: ip addr add 192.168.1.100/24 dev enp2s0f0
# Monitor: Check if IP is assigned
# Stop: ip addr del 192.168.1.100/24 dev enp2s0f0
Configuration needed: None on nodes - Pacemaker handles it
What it does:
How it works:
# Pacemaker manages automatically:
# Start: mount -t nfs 192.168.1.246:/volume1/ha-cluster /mnt/ha-shared
# Monitor: Check if mounted
# Stop: umount /mnt/ha-shared
Configuration needed:
/mnt/ha-shared, /senutinfs-utils/etc/fstabFor resources with long startup times (web-containers):
# Increase start timeout to 300s (matches systemd TimeoutStartSec)
sudo pcs resource update ha-web-containers op start timeout=300s
# Verify
sudo pcs resource config ha-web-containers
Default timeouts:
start timeout=100s - May be too short for Docker builds/pullsstop timeout=100s - Usually sufficientmonitor interval=30s - Health checks every 30 secondsOn first node setup, manually build/pull to avoid timeout:
# Navigate to compose directory
cd /mnt/ha-shared/web-containers
# Pull public images
sudo docker compose pull
# Build custom images (slow on first run)
sudo docker compose build
# Start containers manually
sudo docker compose up -d
# Verify running
sudo docker ps
# Now let Pacemaker manage it
sudo pcs resource cleanup ha-web-containers
Why this is needed:
# Cluster status
sudo pcs status
# Node status
sudo pcs status nodes
# Resource status
sudo pcs resource status
# Stop cluster
sudo pcs cluster stop --all
# Start cluster
sudo pcs cluster start --all
# Move resource to specific node
sudo pcs resource move cluster_vip thing-1
# Clear resource constraints
sudo pcs resource clear cluster_vip
# Put node in standby
sudo pcs node standby thing-1
# Unstandby node
sudo pcs node unstandby thing-1
# Cleanup failed resources
sudo pcs resource cleanup
sudo pcs resource cleanup ha-web-containers
# View resource constraints
sudo pcs constraint
# Manual failover to thing-1
sudo pcs node standby thing-2
# Watch resources migrate to thing-1
watch sudo pcs status
# Verify services accessible on VIP
curl -I http://192.168.1.100
# Failback to thing-2
sudo pcs node unstandby thing-2
sudo pcs node standby thing-1
# Or let both be online (resources stay put unless node fails)
sudo pcs node unstandby thing-2
sudo pcs node unstandby thing-1
When a node is rebuilt and needs to rejoin cluster:
# From the working node (thing-2), authenticate new node
sudo pcs host auth thing-1 -u hacluster
# Sync cluster config to rebuilt node
sudo pcs cluster sync
# Start cluster on rebuilt node
sudo pcs cluster start thing-1
sudo pcs cluster enable thing-1
# Unstandby the node
sudo pcs node unstandby thing-1
# Verify both nodes online
sudo pcs status nodes
# View logs
sudo journalctl -u pacemaker -f
sudo journalctl -u corosync -f
sudo journalctl -u ha-web-containers -n 100
# Check cluster configuration
sudo pcs config
# Verify corosync membership
sudo pcs status corosync
# Clear failed resources
sudo pcs resource cleanup
# Check listening ports (cluster uses 2224, 3121, 5403)
sudo ss -tulpn | grep -E "2224|3121|5403|21064"
# Check Docker containers
sudo docker ps
sudo docker network ls
# Test manual service start
sudo systemctl start ha-web-containers
sudo systemctl status ha-web-containers
On server node:
# Install iperf3
sudo dnf install -y iperf3
# Start server
iperf3 -s
On client node:
# Install iperf3
sudo dnf install -y iperf3
# Test to server's 10GbE address
iperf3 -c 10.0.0.22
# Test main network
iperf3 -c 192.168.1.22
# Test reverse direction
iperf3 -c 10.0.0.22 -R
# Test parallel streams
iperf3 -c 10.0.0.22 -P 4
Expected 10GbE results:
# Check available NFS shares
showmount -e 192.168.1.246
# Test mount
sudo mkdir -p /mnt/test
sudo mount -t nfs 192.168.1.246:/volume1/ha-cluster /mnt/test
# Verify mount
df -h | grep ha-cluster
# Write speed test
time dd if=/dev/zero of=/mnt/test/speedtest bs=1M count=5000
# Clean up
rm /mnt/test/speedtest
sudo umount /mnt/test
/boot/efi - 600 MiB (FAT32)
/boot - 1024 MiB (XFS or ext4)
LVM Volume Group: almalinux (~1.99TB total)
Allocated:
├─ / - 50 GB (XFS)
├─ /var - 200 GB (XFS)
├─ /home - 10 GB (XFS)
└─ swap - 16 GB
Unallocated: ~1.7TB reserved for future growth
If partition is too small initially:
# Check current partition layout
sudo parted /dev/sda print
# Extend partition to use all space
sudo parted /dev/sda resizepart 3 100%
# Resize Physical Volume
sudo pvresize /dev/sda3
# Verify free space
sudo pvs
sudo vgs
Extending Logical Volumes:
# Extend /var by 100GB
sudo lvextend -L +100G /dev/almalinux/var
# Grow filesystem (XFS)
sudo xfs_growfs /var
# Or for ext4
sudo resize2fs /dev/almalinux/var
# Check Physical Volumes
sudo pvs
# Check Volume Groups
sudo vgs
# Check Logical Volumes
sudo lvs
# Check filesystem usage
df -h
# Block device overview
lsblk
# Show all connections
nmcli connection show
# Show active connections
nmcli connection show --active
# Activate connection
sudo nmcli connection up 10g-external
# Reload connections
sudo nmcli connection reload
# Restart NetworkManager
sudo systemctl restart NetworkManager
# Delete duplicate connection
sudo nmcli connection delete <UUID>
# Rename connection
sudo nmcli connection modify <old-name> connection.id <new-name>
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.21 thing-1
192.168.1.22 thing-2
192.168.2.21 thing-1-cluster
192.168.2.22 thing-2-cluster
192.168.1.246 synology
192.168.1.X jgmelon # Private Docker registry
Mount backup location:
cd /mnt/ha-shared/thing-1-backup
Copy network configs:
# From thing-2 to thing-1
scp etc/hosts thing-1:/tmp/
scp -r etc/NetworkManager/system-connections thing-1:/tmp/
scp -r etc/corosync thing-1:/tmp/
scp -r etc/firewalld thing-1:/tmp/
scp -r etc/docker thing-1:/tmp/ # Docker config including insecure registries
Apply on thing-1:
# Restore configs
sudo cp /tmp/hosts /etc/hosts
sudo cp -r /tmp/system-connections/* /etc/NetworkManager/system-connections/
sudo chmod 600 /etc/NetworkManager/system-connections/*
sudo chown root:root /etc/NetworkManager/system-connections/*
# Restore corosync config
sudo cp -r /tmp/corosync /etc/
# Restore firewall config
sudo rsync -av /tmp/firewalld/ /etc/firewalld/
sudo firewall-cmd --reload
# Restore Docker config (if backed up)
sudo cp /tmp/docker/daemon.json /etc/docker/daemon.json
sudo systemctl restart docker
# Reload network
sudo nmcli connection reload
sudo systemctl restart NetworkManager
Copy systemd service files:
# From thing-2
scp etc/systemd/system/ha-*.service thing-1:/tmp/
# On thing-1
sudo cp /tmp/ha-*.service /etc/systemd/system/
sudo systemctl daemon-reload
If Navidrome reports "database disk image is malformed":
# Stop navidrome first
sudo pcs resource disable ha-navidrome
# Attempt WAL checkpoint (may fix minor corruption)
sudo sqlite3 /var/lib/navidrome/navidrome.db "PRAGMA wal_checkpoint(TRUNCATE);"
# Check integrity
sudo sqlite3 /var/lib/navidrome/navidrome.db "PRAGMA integrity_check;" 2>&1 | head -20
# If corrupted, attempt recovery to new database
sudo sqlite3 /var/lib/navidrome/navidrome.db ".recover" | sudo sqlite3 /tmp/navidrome_recovered.db
sudo sqlite3 /tmp/navidrome_recovered.db "PRAGMA integrity_check;"
Even heavily corrupted databases can often yield user data:
# Extract users from recovered database
sudo sqlite3 /tmp/navidrome_recovered.db "SELECT user_name, email FROM user;" 2>/dev/null
# Check tables available
sudo sqlite3 /tmp/navidrome_recovered.db ".tables"
# Get all libraries with paths
sudo sqlite3 /tmp/navidrome_recovered.db "SELECT id, name, path FROM library;" 2>/dev/null
# Count user_library entries
sudo sqlite3 /tmp/navidrome_recovered.db "SELECT COUNT(*) FROM user_library;" 2>/dev/null
# Stop navidrome (database must not be locked)
sudo pcs resource disable ha-navidrome
# Insert libraries from recovered database
sudo sqlite3 /tmp/navidrome_recovered.db ".dump library" | grep "^INSERT" | sudo sqlite3 /var/lib/navidrome/navidrome.db
# Clear existing user_library entries and reinsert
sudo sqlite3 /var/lib/navidrome/navidrome.db "DELETE FROM user_library;"
sudo sqlite3 /tmp/navidrome_recovered.db ".dump user_library" | grep "^INSERT" | sudo sqlite3 /var/lib/navidrome/navidrome.db
# Verify counts
sudo sqlite3 /var/lib/navidrome/navidrome.db "SELECT COUNT(*) FROM library;"
sudo sqlite3 /var/lib/navidrome/navidrome.db "SELECT COUNT(*) FROM user_library;"
sudo pcs resource enable ha-navidrome
If users were manually recreated, their IDs will differ from the recovered database. Fix user_library assignments:
# Compare IDs between databases
sudo sqlite3 /var/lib/navidrome/navidrome.db "SELECT id, user_name FROM user;"
sudo sqlite3 /tmp/navidrome_recovered.db "SELECT id, user_name FROM user;"
# Update mismatched user IDs in user_library
sudo pcs resource disable ha-navidrome
sudo sqlite3 /var/lib/navidrome/navidrome.db "
UPDATE user_library SET user_id='<new_id>' WHERE user_id='<old_id>';
"
sudo pcs resource enable ha-navidrome
# Manual backup (stop navidrome first for clean backup)
sudo pcs resource disable ha-navidrome
sudo sqlite3 /var/lib/navidrome/navidrome.db ".backup /mnt/ha-shared/navidrome/navidrome.db.backup-$(date +%Y%m%d-%H%M%S)"
sudo pcs resource enable ha-navidrome
Automated backup script is installed at /usr/local/bin/navidrome-backup.sh and runs nightly at 3am. It is DRBD-aware and only runs on the Primary node. Keeps 7 days of backups.
On new node (thing-1):
container-selinux packagels -Z /usr/bin/dockerd → should be container_runtime_exec_tsudo docker network create webnet# 1. Network connectivity
ping thing-2
ping 10.0.0.22
ping 192.168.1.246
# 2. Docker ready
docker --version
docker network ls | grep webnet
docker info | grep -A 5 "Insecure Registries"
ls -Z /usr/bin/dockerd # Must show container_runtime_exec_t
# 3. Firewall configured
sudo firewall-cmd --list-all
# 4. Mount points exist
ls -ld /mnt/ha-shared /senuti
# 5. Services exist
ls /etc/systemd/system/ha-*.service
# 6. Cluster communication
sudo pcs status corosync
# 7. Can reach NFS
showmount -e 192.168.1.246
# 1. Put old active node in standby
sudo pcs node standby thing-2
# 2. Watch resources migrate
watch sudo pcs status
# 3. Verify services on new node
sudo docker ps
curl -I http://192.168.1.100
# 4. If issues, rollback
sudo pcs node standby thing-1
sudo pcs node unstandby thing-2
# 5. If successful, bring both online
sudo pcs node unstandby thing-2
sudo pcs node unstandby thing-1
| Port | Protocol | Service | Description |
|---|---|---|---|
| 2224 | TCP | pcsd | Cluster management daemon |
| 3121 | TCP | Pacemaker | Pacemaker remote |
| 5403 | UDP | Corosync | Cluster communication |
| 5404-5405 | UDP | Corosync | Multicast communication |
| 21064 | TCP | DLM | Distributed Lock Manager |
| Port | Protocol | Service | Description |
|---|---|---|---|
| 80 | TCP | HTTP | Web traffic |
| 443 | TCP | HTTPS | Secure web traffic |
| 25 | TCP | SMTP | Mail transfer |
| 587 | TCP | Submission | Mail submission |
| 110 | TCP | POP3 | Mail retrieval |
| 995 | TCP | POP3S | Secure POP3 |
| 143 | TCP | IMAP | Mail access |
| 993 | TCP | IMAPS | Secure IMAP |
| 4190 | TCP | Sieve | Mail filtering |
| 3000 | TCP | Navidrome | Music streaming |
| 4533 | TCP | Navidrome | Navidrome web UI |
| 5201 | TCP | iperf3 | Network testing |
| 27018 | TCP | MongoDB | Database |
Solution: Create the network on both nodes:
sudo docker network create webnet
Solution: Configure insecure registry in /etc/docker/daemon.json:
{"insecure-registries": ["jgmelon:5002"]}
Then sudo systemctl restart docker
Solution: Manually build/pull first, or increase timeout:
sudo pcs resource update ha-web-containers op start timeout=300s
Solution: Remove old key:
ssh-keygen -R thing-1
Solution: Cleanup transient monitor failures:
sudo pcs resource cleanup
permission denied while trying to connect to the Docker daemon socket even with sudoSolution: container-selinux is missing. Install it and restorecon:
sudo dnf install -y container-selinux
sudo restorecon -v /usr/bin/dockerd
sudo systemctl restart docker
Cause: restorecon was run before container-selinux was installed, labeling dockerd as bin_t instead of container_runtime_exec_t.
Solution:
sudo dnf install -y container-selinux
sudo restorecon -v /usr/bin/dockerd
# Kill any stuck dockerd process
sudo kill $(pgrep dockerd)
sudo systemctl start docker
failed to create temp dir: stat /var/lib/containerd/.../snapshots: no such file or directorySolution:
sudo mkdir -p /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots
sudo systemctl restart containerd
Cause: Usually disk full (WAL file can't be written) or ungraceful unmount.
Solution: See Navidrome Database Recovery section above.