Two-node high-availability cluster using Corosync/Pacemaker with dual-ring redundant heartbeat configuration.
Cluster Name: mycluster
Transport: knet (Kronosnet)
Nodes: thing-1 (10.0.0.21), thing-2 (10.0.0.22)
Encryption: AES-256 with SHA-256 hashing
thing-1 thing-2
├─ enp2s0f0 (10G SFP+) ├─ enp2s0f0 (10G SFP+)
│ ├─ IP: 192.168.1.21 │ ├─ IP: 192.168.1.22
│ ├─ MTU: 9000 │ ├─ MTU: 9000
│ └─ → UniFi Switch │ └─ → UniFi Switch
│ │
├─ enp2s0f1 (10G SFP+) ├─ enp2s0f1 (10G SFP+)
│ ├─ IP: 10.0.0.21 │ ├─ IP: 10.0.0.22
│ ├─ MTU: 9000 │ ├─ MTU: 9000
│ └─ → Direct Crossover ←───────┘ └─ → Direct Crossover
Rationale: Ring 0 on direct crossover ensures cluster communication survives switch failure, preventing split-brain scenarios. Even if the switch fails and production services are unavailable, the cluster maintains coordination and prevents dual-primary conflicts.
Connection Name: 10g-external
IP Address: 192.168.1.21/24
Gateway: 192.168.1.1
MTU: 9000
Metric: 103
Role: Ring 1 (backup heartbeat), production traffic
Connection Name: 10g-interconnect
IP Address: 10.0.0.21/24
MTU: 9000
Metric: 102 (higher priority)
Role: Ring 0 (primary heartbeat)
Connection Name: 10g-external
IP Address: 192.168.1.22/24
Gateway: 192.168.1.1
MTU: 9000
Metric: 103
Role: Ring 1 (backup heartbeat), production traffic
Connection Name: 10g-interconnect
IP Address: 10.0.0.22/24
MTU: 9000
Metric: 102 (higher priority)
Role: Ring 0 (primary heartbeat)
Location: /etc/corosync/corosync.conf
totem {
version: 2
cluster_name: mycluster
transport: knet
crypto_cipher: aes256
crypto_hash: sha256
cluster_uuid: 70ea570decca4757b47cc4ddda613e51
}
nodelist {
node {
ring0_addr: 10.0.0.21
ring1_addr: 192.168.1.21
name: thing-1
nodeid: 1
}
node {
ring0_addr: 10.0.0.22
ring1_addr: 192.168.1.22
name: thing-2
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
timestamp: on
}
totem section:
transport: knet - Kronosnet transport (supports multiple links)crypto_cipher: aes256 - Encryption for cluster communicationcrypto_hash: sha256 - Message authenticationnodelist section:
ring0_addr - Primary heartbeat IP (direct crossover)ring1_addr - Secondary heartbeat IP (via switch)quorum section:
two_node: 1 - Enables two-node cluster operation without requiring majority# List all interfaces with IPs
ip a
# Check specific interface link status
ethtool enp2s0f0
ethtool enp2s0f1
# Expected output:
# Speed: 10000Mb/s
# Duplex: Full
# Link detected: yes
# View all ring addresses
sudo corosync-cmapctl | grep ring
# Expected output:
# nodelist.node.0.ring0_addr (str) = 10.0.0.21
# nodelist.node.0.ring1_addr (str) = 192.168.1.21
# nodelist.node.1.ring0_addr (str) = 10.0.0.22
# nodelist.node.1.ring1_addr (str) = 192.168.1.22
# View ring status and health
sudo corosync-cfgtool -s
# Expected: Both rings showing as operational
# Install iperf3 if needed
sudo apt install iperf3
# On one node (server mode):
iperf3 -s
# On other node (client mode):
# Test direct crossover (ring0)
iperf3 -c 10.0.0.21 -t 30
# Test via switch (ring1)
iperf3 -c 192.168.1.21 -t 30
# Expected: ~9.9 Gbps sustained on both paths
# View cluster members
sudo corosync-cmapctl | grep members
# View quorum status
sudo corosync-quorumtool
# View detailed cluster status
sudo pcs status
# Watch heartbeat traffic on ring0
sudo tcpdump -i enp2s0f1 port 5405
# Watch heartbeat traffic on ring1
sudo tcpdump -i enp2s0f0 port 5405
The dual-ring configuration protects against split-brain scenarios through redundant communication paths:
Key Point: Even if the switch fails and makes production services unavailable, the direct crossover keeps the cluster nodes talking to each other, preventing both nodes from independently trying to become primary and causing data corruption.
Symptoms: Excessive packet loss on ring0
Check:
iperf3 -c 10.0.0.21 -t 60
# Look for high "Retr" count
Solutions:
Symptoms: Cluster shows quorum lost/regained in logs
Check ring health:
# Check for errors
sudo corosync-cfgtool -s
# Monitor logs
sudo journalctl -u corosync -f
Solutions:
Symptoms: Long delay before failover completes
Check configuration:
sudo corosync-cmapctl | grep token
Solutions:
Check detailed ring status:
sudo corosync-cfgtool -s
If one ring down:
ip link showip addr showping -I [interface] [peer-ip]journalctl -u corosync -n 100After editing /etc/corosync/corosync.conf:
# On both nodes:
sudo systemctl restart corosync
sudo systemctl restart pacemaker
# Verify cluster reformed:
sudo pcs status
sudo corosync-quorumtool
Note: Never restart corosync on both nodes simultaneously. Restart one, verify it rejoins, then restart the other.
Direct Crossover (Ring 0):
Via Switch (Ring 1):
Synology NFS (via Ring 1):
Conclusion: Single 10G link has 2-3x headroom for current workload. Network is not the bottleneck - storage RAID is.
Configuration:
Each node:
├─ 2x 10G SFP+ bonded (LACP) → switch
│ └─ Production traffic + NFS
└─ 1x 1G RJ45 direct crossover → other node
└─ Dedicated heartbeat only
Benefits:
When to Consider:
Configuration:
Each node:
├─ 2x 10G SFP+ bonded (LACP) → switch
└─ 2x 1G RJ45 bonded → other node
└─ Redundant heartbeat paths
Benefits:
Tradeoff: Probably overkill - 1G is already massive for heartbeat traffic
Model: Intel X520-DA2
Chipset: Intel 82599ES
Ports: 2x 10G SFP+
Driver: ixgbe (in-kernel)
Features:
Direct Crossover: 10G SFP+ DAC (Direct Attach Copper)
To Switch: 10G SFP+ DAC or fiber (model dependent)
Model: UniFi Switch (10G SFP+ capable)
Configuration:
Last Updated: January 13, 2026
Maintained By: Josh G
Cluster Version: Corosync 3.x, Pacemaker 2.x