Problem s RAID1 pres mdadm

al.linux na bcpraha.com al.linux na bcpraha.com
Čtvrtek Listopad 15 16:20:07 CET 2007


Ahoj,

mame server nad nimz bezi Xen open source 3.1.0 nad Debian Etch a par (4)
virtualizovanych instanci W2K3 Server x64.

Stala se nam zvlastni vec - jedna z instanci virtualnich serveru sletela
zrejme v dusledku chyby pri cteni z RAID subsystemu (zvlastni, i pri chybe
na jednom z disku v raid1 by k problemu teoreticky nemelo dochazet ale vem
to cert - nastesti ta instance nebyla zivotne dulezita, takze po xm
destroy a xm create znovu nabehla relativne bez problemu).

vetsi starosti mi zpusobuje ze nevim proc mdadm zahlasil chyby ktere ale
nevedly k oznaceni daneho svazku jako chybneho - v kern.logu jsme pak
nasli nasledujici zpravy:

Nov 15 14:14:19 omega kernel: sd 0:0:1:0: SCSI error: return code =
0x08000002
Nov 15 14:14:19 omega kernel: sdb: Current: sense key: Medium Error
Nov 15 14:14:19 omega kernel:     Additional sense: Unrecovered read
error
Nov 15 14:14:19 omega kernel: Info fld=0x12832f4d
Nov 15 14:14:19 omega kernel: end_request: I/O error, dev sdb, sector
310587213
Nov 15 14:14:19 omega kernel: raid1: sdb2: rescheduling sector 310394432
Nov 15 14:14:19 omega kernel: raid1: sdb2: rescheduling sector 310394440
Nov 15 14:14:24 omega kernel: raid1: sda2: redirecting sector 310394432 to
another mirror
Nov 15 14:14:28 omega kernel: raid1: sda2: redirecting sector 310394440 to
another mirror
Nov 15 14:14:28 omega kernel: qemu-dm[6305]: segfault at 0000000000000000
rip 0000000000000000 rsp 0000000041000ca8 error 14
Nov 15 14:14:28 omega kernel: xenbr0: port 4(tap0) entering disabled
state
Nov 15 14:14:28 omega kernel: device tap0 left promiscuous mode
Nov 15 14:14:28 omega kernel: audit(1195132468.260:16): dev=tap0 prom=0
old_prom=256 auid=4294967295
Nov 15 14:14:28 omega kernel: xenbr0: port 4(tap0) entering disabled
state

Zajimave je prave to, ze disk sdb ktery podle uvedeneho logu zpusobil
problemy, je v poli oznacen stale jako v poradku, spare svazky take
necinne odpocivaji, takze ted nevim co zkouset nebo nezkouset, resp.
nejsou mi jasne priciny a k cemu to muze vest... Nemate s tim nekdo
zkusenost co je mozno ocekavat? 

Vypis pres 
cat /proc/mdstat/ 
vypada uplne normalne:
Personalities : [raid1]
md3 : active raid1 sdc2[0] sde2[2](S) sdd2[1]
      488287552 blocks [2/2] [UU]

md2 : active raid1 sdc1[0] sde1[2](S) sdd1[1]
      96256 blocks [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
      488287552 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      96256 blocks [2/2] [UU]

unused devices: <none>

 a stejne i 
mdadm --detail /dev/mdX 
pro vsechna tri pole 0-3...

S.M.A.R.T take nehlasi nejaky problem:
omega:~# smartctl -a /dev/sdb
smartctl version 5.36 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: ATA      WDC WD5000ABYS-0 Version: 1C01
Serial number:      WD-WCAPW1762778
Device type: disk
Local Time is: Thu Nov 15 16:03:32 2007 CET
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Error Counter logging not supported
Device does not support Self Test logging
omega:~#

Co byste poradili, jak moc byt nervozni, resp. da se neco jeste zkusit
diagnostikovat?
smartctl offline test projde bez problemu, jine testy disky zrejme
nepodporuji... 



Další informace o konferenci Linux