heartbeat a divne chovani [delsi]

Zdenek SUTR Kaminski sutr na valachian-labs.com
Pátek Říjen 7 10:26:19 CEST 2005


Zdravim,

  rozhodl jsem se otestovat heartbeat a pozoruji divne chovani. Konfigurace 
je nasledujici:

ha.cf:
------
logfacility local0
keepalive 5
deadtime 20
warntime 7
initdead 26
serial /dev/ttyS0
auto_failback on
node    rra.firma.cz
node    rrb.firma.cz
debug 0

haresources:
------------
rrb.firma.cz IPaddr2::192.168.1.10 apache

authkeys:
---------
auth 2
2 crc

rra# rpm -qa | grep heartbeat
heartbeat-pils-2.0.2-1.zk.rhel4
heartbeat-2.0.2-1.zk.rhel4
heartbeat-stonith-2.0.2-1.zk.rhel4


Keepalive, deadtime a initdead mam nastavene nizko proto, protoze vypadky 
heartbeatu simuluji stopnutim/nastartovanim sluzby a at tedy nemusim 
pripadne cekat dlouho na hlasky v logu a migraci sluzeb... A ze nepozivam 
xml u heartbeat2? No pisou, ze stara konfigurace je mozna a platna.

A ted to divne chovani:

0. oba heartbeaty jsou stopnute
1. nastartuji RRB
  - a cekam 26 vt., nez prohlasi, ze RRA nezije
  - pak se nastartuje sluzba apache a take vidim, ze
    se na RRB prida ona ip adresa.
  - v logu RRB je zajimave toto:
Oct  7 09:58:28 rrb heartbeat: [9309]: info: Local status now set to: 'up'
...
Oct  7 09:58:54 rrb heartbeat: [9309]: WARN: node rra.firma.cz: is dead
Oct  7 09:58:54 rrb heartbeat: [9309]: info: Local status now set to: 'active'
....
Oct  7 09:59:04 rrb heartbeat: [9309]: info: Local Resource acquisition completed. (none)
Oct  7 09:59:04 rrb heartbeat: [9309]: info: local resource transition completed.

coz je v poradku.

2. startuji RRA
  - v logu RRB vidim:
Oct  7 10:03:04 rrb heartbeat: [9309]: info: Link rra.firma.cz:/dev/ttyS0 up.
Oct  7 10:03:04 rrb heartbeat: [9309]: info: Status update for node rra.firma.cz: status init
Oct  7 10:03:04 rrb heartbeat: [9572]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
Oct  7 10:03:04 rrb harc[9572]: info: Running /etc/ha.d/rc.d/status status
Oct  7 10:03:04 rrb heartbeat: [9309]: info: Status update for node rra.firma.cz: status up
Oct  7 10:03:04 rrb heartbeat: [9582]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
Oct  7 10:03:04 rrb harc[9582]: info: Running /etc/ha.d/rc.d/status status
Oct  7 10:03:04 rrb heartbeat: [9309]: info: Status update for node rra.firma.cz: status active
Oct  7 10:03:04 rrb heartbeat: [9592]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
Oct  7 10:03:04 rrb harc[9592]: info: Running /etc/ha.d/rc.d/status status
Oct  7 10:03:05 rrb heartbeat: [9309]: info: remote resource transition completed.
Oct  7 10:03:05 rrb heartbeat: [9309]: info: rrb.firma.cz wants to go standby [foreign]
Oct  7 10:03:05 rrb heartbeat: [9309]: info: standby: rra.firma.cz can take our foreign resources
Oct  7 10:03:05 rrb heartbeat: [9602]: info: give up foreign HA resources (standby).
Oct  7 10:03:05 rrb heartbeat: [9602]: info: foreign HA resource release completed (standby).
Oct  7 10:03:05 rrb heartbeat: [9309]: info: Local standby process completed [foreign].
Oct  7 10:03:05 rrb heartbeat: [9309]: WARN: 1 lost packet(s) for [rra.firma.cz] [8:10]
Oct  7 10:03:05 rrb heartbeat: [9309]: info: remote resource transition completed.
Oct  7 10:03:05 rrb heartbeat: [9309]: info: No pkts missing from rra.firma.cz!
Oct  7 10:03:05 rrb heartbeat: [9309]: info: Other node completed standby takeover of foreign resources.


Proc mel mit RRA status 'active'? Nemel by byt jen 'up'?

  - v logu RRA vidim:
Oct  7 10:03:03 rra heartbeat: [11101]: info: Local status now set to: 'up'
Oct  7 10:03:04 rra heartbeat: [11104]: info: pid 11104 locked in memory.
Oct  7 10:03:04 rra heartbeat: [11106]: info: pid 11106 locked in memory.
Oct  7 10:03:04 rra heartbeat: [11105]: info: pid 11105 locked in memory.
Oct  7 10:03:04 rra heartbeat: [11101]: info: Link rrb.firma.cz:/dev/ttyS0 up.
Oct  7 10:03:04 rra heartbeat: [11101]: info: Status update for node rrb.firma.cz: status active
Oct  7 10:03:04 rra heartbeat: [11101]: info: Local status now set to:'active'
Oct  7 10:03:04 rra heartbeat: [11107]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
Oct  7 10:03:04 rra harc[11107]: info: Running /etc/ha.d/rc.d/status status
Oct  7 10:03:04 rra heartbeat: [11101]: info: remote resource transition completed.
Oct  7 10:03:04 rra heartbeat: [11101]: info: remote resource transition completed.
Oct  7 10:03:04 rra heartbeat: [11101]: info: Local Resource acquisition completed. (none)
Oct  7 10:03:05 rra heartbeat: [11101]: info: rrb.firma.cz wants to go standby [foreign]
Oct  7 10:03:05 rra heartbeat: [11101]: info: standby: acquire [foreign] resources from rrb.firma.cz
Oct  7 10:03:05 rra heartbeat: [11120]: info: acquire local HA resources (standby).
Oct  7 10:03:05 rra heartbeat: [11120]: info: local HA resource acquisition completed (standby).
Oct  7 10:03:05 rra heartbeat: [11101]: info: Standby resource acquisition done [foreign].
Oct  7 10:03:05 rra heartbeat: [11101]: info: Initial resource acquisition complete (auto_failback)
Oct  7 10:03:06 rra heartbeat: [11101]: info: remote resource transition completed.

Zda se, ze RRA si mysli, ze je 'active', no neprebira zadnou IP adresu, 
ani nestartuje apache. Coz bych jeste chapal. Takze asi spatne chapu pojem 
'active'. Nu, pojdme dale.

3. stopnu RRA
  - a ocekaval bych, ze na RRB nebude nic delat, jen prohlasi, ze RRA umira. 
Nicmene v logu RRB vidim:

Oct  7 10:11:42 rrb heartbeat: [9309]: info: Received shutdown notice from 'rra.firma.cz'.
Oct  7 10:11:42 rrb heartbeat: [9309]: info: Resources being acquired from rra.firma.cz.
Oct  7 10:11:42 rrb heartbeat: [9309]: debug: StartNextRemoteRscReq(): child count 1
Oct  7 10:11:42 rrb heartbeat: [9623]: info: acquire local HA resources (standby).
Oct  7 10:11:42 rrb ResourceManager[9644]: info: Acquiring resource group: rrb.firma.cz IPaddr2::192.168.1.150 nothing
Oct  7 10:11:42 rrb heartbeat: [9624]: info: Local Resource acquisition completed.
Oct  7 10:11:42 rrb heartbeat: [9309]: debug: StartNextRemoteRscReq(): child count 1
Oct  7 10:11:42 rrb ResourceManager[9644]: info: Running /etc/init.d/apache  start
Oct  7 10:11:42 rrb ResourceManager[9644]: debug: Starting /etc/init.d/apache  start
Oct  7 10:11:42 rrb ResourceManager[9644]: debug: /etc/init.d/apache start done. RC=0
Oct  7 10:11:42 rrb heartbeat: [9623]: info: local HA resource acquisition completed (standby).
Oct  7 10:11:42 rrb heartbeat: [9309]: info: Standby resource acquisition done [foreign].
Oct  7 10:11:42 rrb heartbeat: [9766]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
Oct  7 10:11:42 rrb harc[9766]: info: Running /etc/ha.d/rc.d/status status
Oct  7 10:11:42 rrb mach_down[9776]: info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
Oct  7 10:11:42 rrb heartbeat: [9309]: info: mach_down takeover complete.
Oct  7 10:11:42 rrb mach_down[9776]: info: mach_down takeover complete for node rra.firma.cz.
Oct  7 10:12:04 rrb heartbeat: [9309]: WARN: node rra.firma.cz: is dead
Oct  7 10:12:04 rrb heartbeat: [9309]: info: Dead node rra.firma.cz gave up resources.
Oct  7 10:12:04 rrb heartbeat: [9309]: info: Link rra.firma.cz:/dev/ttyS0 dead.
Oct  7 10:12:04 rrb heartbeat: [9313]: WARN: glib: TTY write timeout on [/dev/ttyS0]

Proc se sakra RRB znova pokousi startovat apache?

A v logu RRA vidim:

Oct  7 10:11:41 rra heartbeat: [11101]: info: Heartbeat shutdown in progress. (11101)
Oct  7 10:11:41 rra heartbeat: [12895]: info: Giving up all HA resources.
Oct  7 10:11:41 rra ResourceManager[12905]: info: Releasing resource group: rrb.firma.cz IPaddr2::192.168.1.150 apache
Oct  7 10:11:41 rra ResourceManager[12905]: info: Running /etc/init.d/apache  stop
Oct  7 10:11:41 rra ResourceManager[12905]: debug: Starting /etc/init.d/apache  stop
Oct  7 10:11:41 rra ResourceManager[12905]: debug: /etc/init.d/apache stop done. RC=0
Oct  7 10:11:41 rra ResourceManager[12905]: info: Running /etc/ha.d/resource.d/IPaddr2 192.168.1.150 stop
Oct  7 10:11:41 rra ResourceManager[12905]: debug: Starting/etc/ha.d/resource.d/IPaddr2 192.168.1.150 stop
Oct  7 10:11:41 rra ResourceManager[12905]: debug: /etc/ha.d/resource.d/IPaddr2 192.168.1.150 stop done. RC=0
Oct  7 10:11:41 rra heartbeat: [12895]: info: All HA resources relinquished.
Oct  7 10:11:42 rra heartbeat: [11101]: WARN: 1 lost packet(s) for [rrb.firma.cz] [177:179]
Oct  7 10:11:42 rra heartbeat: [11101]: info: No pkts missing from rrb.firma.cz!
Oct  7 10:11:43 rra heartbeat: [11101]: info: killing HBFIFO process 11104 with signal 15
Oct  7 10:11:43 rra heartbeat: [11101]: info: killing HBWRITE process 11105 with signal 15
Oct  7 10:11:43 rra heartbeat: [11101]: info: killing HBREAD process 11106 with signal 15
Oct  7 10:11:43 rra heartbeat: [11101]: info: Core process 11104 exited. 3 remaining
Oct  7 10:11:43 rra heartbeat: [11101]: info: Core process 11106 exited. 2 remaining
Oct  7 10:11:43 rra heartbeat: [11101]: info: Core process 11105 exited. 1 remaining
Oct  7 10:11:43 rra heartbeat: [11101]: info: Heartbeat shutdown complete.

Proc se sakra snazi zbavit ip adresy, kterou nikdy nemel, proc stopuje 
sluzby, ktere nikdy nestartoval? Protoze si mysli, ze je active? Toz uz 
jsem se ptal u bodu 2 :-(


Jestli mi takove chovani vadi? Ano vadi, protoze RRB znova startuje to, co 
uz nastartovane ma a ne vsechny skripty jsou tak dobre (ve smyslu vhodne 
pro heartbeat) udelane, ze zahlasi, ze sluzba jiz bezi a pritom vrati 0. A 
ja opravdu nechci vetsinu skritpu prepisovat a kontrolovat. A vsechno si 
to psat do jednoho sveho, ktery mam osetreny, ze mi take nelibi. Spise 
myslim, ze mam nekde chybu, jen nevim kde :-(

Dela mi to i ve starsich verzi heartbeatu, takze jim to snad byt nemuze. 
Vzdyt by to byl soft na prd, ze jo?

Toz jestli nekoho neco kloudneho napadne, toz budu vdecny.

Z.K.
--
Wallachian Laboratories? Freeride in UN*X systems...


Další informace o konferenci Linux