NETASQ Online Training Session 8
High Availability
© NETASQ 2008
S ummary • • • • • • • • • •
Global functionning Heartbeat Control connection State replication Synchronisation Update process Diagnosis The output of hasstatus –s Logs analysis Known problems and tips NETASQ – CORPORATE PRESENTATION
2
G lobal functionning 1/2 • • • • • • •
Active/passive high-availability system Two roles: master and slave, defined by licence Two states: active and passive Heartbeat (ping) used to detect dead peer (triggers HA swap) Control connection used to carry state replication and HA informations MAC addresses are replicated through the MACAddress token in ~/ConfigFiles/network HA interface’s MAC address is not replicated (MACAddress token is ignored when interface’s name is HA) NETASQ – CORPORATE PRESENTATION
3
G lobal functionning 2/2 •
•
•
Gracious ARP: – Used to keep link-level equipments (e.g., switches) updated – ARP is-at packets sent periodically and on swap – Handled by arpreset through eventd Quality (triggers HA swap): – Percentage of interfaces that are up and running – Only about enabled interfaces (including VLAN) – Doesn't include HA interface(s) – Use the carrier detection mechanism (ifconfig’s status field) – When both firewalls are active and have same quality, gives active state to master and passive state to slave Priority: when both firewalls have same quality, gives active state to the selected one (e.g., master) and passive state to the other one. NETASQ – CORPORATE PRESENTATION
4
Heartbeat 1/2 • Based on three parameters: – Period, delay in seconds between heartbeats – FailoverPeriod, maximum delay in seconds to wait for a reply – Limit, maximum number of failures before swapping
• Maximum delay before swapping is Period + Limit x FailoverPeriod
• Handled by hacheckstatus through eventd (see /var/tmp/eventd.rules) • Have priority on the control connection state • Used on both main and backup links NETASQ – CORPORATE PRESENTATION
5
Heartbeat 2/2 • Use ICMP Echo messages (ping)
NETASQ – CORPORATE PRESENTATION
6
C ontrol connection • • •
Handled by serverd on 1300/tcp Used only on the main HA link Carry HA informations on both firewalls: – – – – –
• •
Role, master or slave State, active or passive Quality, percentage of interfaces that are up and running Configuration synchronisation state Number of heartbeat (ping) failures
On boot, if the control connection can’t be established, become active While running, if the control connection is lost, keep current state until heartbeat failure NETASQ – CORPORATE PRESENTATION
7
S tate replication 1/2 • • •
Done through the control connection Only on the main link What is replicated: – Connections (sfctl -s conn) – Hosts (sfctl -s host)
•
What is not replicated: – – – – –
•
Plugins attachement and state NAT sessions (ipnat -l) The SA database (showSAD) Load-balancing state (sfctl -s route) Data used to rewrite TCP sequence numbers
Connections get the RECOVERY state after swap and recover the DATA state after some data is transmitted NETASQ – CORPORATE PRESENTATION
8
S tate replication 2/2 • Can be disabled by setting ForwardASQ to 0 in the Global section of ~/ConfigFiles/highavailability
NETASQ – CORPORATE PRESENTATION
9
S ynchronis ation • •
Done through serverd’s ha sync command Allow to synchronise: – – – – – – – –
•
Configuration (temporarily stored in /tmp/hasync.tgz) ClamAV database (mode=clamav) Kaspersky database (mode=kaspersky) NETASQ URL groups (mode=urlfiltering) Optenet URL groups (mode=optenet) ASQ contextual signatures (mode=patterns) Anti-spam data (mode=antispam) Seismo data (mode=pvm)
Synchronisation of data (i.e., mode=) run by autoupdate when control connection is up NETASQ – CORPORATE PRESENTATION
10
Update proces s • Default: the active firewall is updated • Passive update: – The passive firewall is updated through the active one – Use the control connection to send the new firmware – Can only upgrade from the last minor of the current version to the next major version When 7.0.0 was released, the latest 6.3 release was 6.3.3 so passive update from 6.3.0, 6.3.1 and 6.3.2 is not possible. So passive update to 7.0.x must be done from 6.3.3 or 6.3.4.
NETASQ – CORPORATE PRESENTATION
11
Diag nos is 1/2 •
System logs (l_system) show: – state changes – heartbeat failures
• • • • •
Alarms (l_alarm) System logs can be logged as alarms (see Manager’s Logs panel) Use hasstatus -s to know current HA status Using netstat -anp tcp | grep 1300, you must see two ESTABLISHED connections in the HA network Using netstat -anp icmp, you must see the following: icm4
0
0
*.*
*.*
NETASQ – CORPORATE PRESENTATION
12
Diag nos is 2/2 • Should see four streams on the main HA link: – – – –
Ping from master to slave Ping from slave to master 1300/tcp connection from master to slave 1300/tcp connection from slave to master
• Should see two streams on the backup HA link: – Ping from master to slave – Ping from slave to master
NETASQ – CORPORATE PRESENTATION
13
The output of hasstatus –s (1/4) • Global data Global: ------pass= adminadmin priority= 0 need_register= 0 peer_backup_active= N/A peer_backup_date= N/A heartbeat_state= 0
peer_error= 0 lock_state= 0 need_change_state= 0 peer_backup_version= N/A send_peer_failure= 4
NETASQ – CORPORATE PRESENTATION
14
The output of hasstatus –s (2/4) Pass
Password protecting HA
Priority
Does the firewall have priority? 0 (no), 4 (slave), or 8 (master)
need_register
Does the firewall need to send the REGISTER command? 0 (no) or 1 (yes)
peer_backup_active
Active partition on the peer (main or backup)
peer_backup_version
Firmware version of peer's backup partition
peer_backup_date
Date of peer's last backup
heartbeat_state
Heartbeat state
peer_error
State of the control connection: 0 connection is ok, 1 error
lock_state
HA transition state, i.e., the firewall is changing state (active/passive)
need_change_state
unused
send_peer_failure
Used to avoid repeatedly logging peer failure NETASQ – CORPORATE PRESENTATION
15
The output of hasstatus –s (3/4) • Local and peer's data Local: -----slicence= Master licence= 8 sstate= Active state= 2 quality= 100 synced= 1 ha= 1 serial= F50-EE046190600601 version= 7.0.2 link_time= 10d 16h 19m 40s last_link= 0d 00h 00m 00s ping_failure= asqdump= 2
Peer: ----slicence= Slave licence= 4 sstate= Passive state= 1 quality= 100 synced= 1 ha= 1 serial= F50-EE575270700606 version= 7.0.2 link_time= 10d 16h 14m 30s last_link= 35d 06h 10m 00s ping_failure= asqdump= 2
NETASQ – CORPORATE PRESENTATION
16
The output of hasstatus –s (4/4) slicence/licence
HA licence (master or slave)
sstate/state
HA state (active or passive)
quality
Percentage of interfaces that are up and running
synced
Is the configuration in sync with the peer? (1:1 → OK)
ha
Is HA enabled? Set to 0 while rebooting for example
serial
Firewall's serial number
version
Firewall's firmware version
link_time
Duration of the control connection (serverd)
last_link
Duration of the previous control connection
ping_failure
Date of last heartbeat failure
asqdump
Version of the connection table
NETASQ – CORPORATE PRESENTATION
17
Log s analys is (1/4) • • • • • • • •
Same state, become ACTIVE [quality] Both firewalls have same state, local peer becomes active due to its better quality Same state, become ACTIVE [licence] Both firewalls have same state, local peer becomes active because being master according to its licence Main link is lost Heartbeat failure treshold has been reached on the main HA link Backup link is lost Heartbeat failure treshold has been reached on the backup HA link Main link is recovered Heartbeat is back on the main HA link Backup link is recovered Heartbeat is back on the backup HA link Connection with passive firewall is lost The control connection (serverd) with the passive firewall is lost Connection with passive firewall recovered The control connection (serverd) with the passive firewall is back NETASQ – CORPORATE PRESENTATION
18
Log s analys is (2/4) • • • • • •
Peer is UNREACHABLE via Main link Heartbeat failure treshold has been reached on main link during unit startup Peer is UNREACHABLE via Backup link Heartbeat failure treshold has been reached on backup link during unit startup Peer not responding, starting in active mode Heartbeat failure treshold has been reached on both links during unit startup, local peer is going active Peer is active, starting in passive mode The control connection (serverd) and heartbeats are up during unit startup, remote peer is active so local peer is going passive Swap state, become ACTIVE [quality] Swap state triggered due to HA quality, local peer is going active Swap state, become ACTIVE [priority] Swap state triggered due to HA priority, local peer is going active NETASQ – CORPORATE PRESENTATION
19
Log s analys is (3/4) • • • • • • •
Active firewall is in locked state Active firewall prevents HA swap (reboot or autoupdate in progress) ASQ Engine in Passive mode Local peer goes passive ASQ Engine in Active mode Local peer goes active Start HA peer firewall monitoring hasstatus starts getting informations from remote peer through the control connection (serverd) Stop HA peer firewall monitoring hasstatus starts getting informations from remote peer through the control connection (serverd) Start HA monitoring locally hasstatus starts getting informations from local peer through serverd Stop HA monitoring locally hasstatus stops getting information from local peer through serverd NETASQ – CORPORATE PRESENTATION
20
Log s analys is (4/4) • • • •
Heartbeat failure occurred with peer firewall Peer did not reply to last heartbeat (ICMP Echo request) Peer firewall is online Remote peer is back after a reboot and local peer gets this informations through the control connection (serverd) Active failed to respond, want become ACTIVE Heartbeat failure treshold has been reached on both HA links, local peer is passive and wants to go active Passive failed to respond Heartbeat failure treshold has been reached on both HA links, local peer is active
NETASQ – CORPORATE PRESENTATION
21
K nown problems and tips 1/2 • •
•
• •
After restoring a configuration you must run the HA wizard Manually rebooting the active firewall (e.g., serverd’s system reboot, shell’s reboot) doesn’t lead to HA swap You can get a shell to the passive firewall by allowing SSH connections on the HA link (with 7.0, you just have to enable the implicit SSH filter) You can use NAT (rdr) to get a Manager access to the passive firewall Synchronization and switch over buttons are greyed out in the Manager if both firewalls are not running the same firmware release (operations not available even in console) NETASQ – CORPORATE PRESENTATION
22
K nown problems and tips 2/2 • Quality information isn't exchanged on the backup HA link: if you lose the main HA link but not the backup one, no swap will occure
NETASQ – CORPORATE PRESENTATION
23