disky a smart

Petr Baláš petr na balas.cz
Sobota Květen 8 10:28:09 CEST 2010


Zdravim

Sklasam tu jeden mensi servrik a vypadavaji mi tu disky ze SW RAID pole.
Pole je RAID5 - /dev/sda3, /dev/sdb3, /dev/sdc3, /dev/sdd3
Dvakrat za sebou (behem par hodin) vypadly disky sda a sdd.

smart vesele tvrdi, ze disky jsou OK:

localhost:~# smartctl -H /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

ale vysledky selftestu s tim moc nekoresponduji:

localhost:~# smartctl -l selftest /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%        61
      1905068652
# 2  Conveyance offline  Completed: read failure       90%        61
      1905068652


Jake mate zkusenosti s duveryhodnosti toho, co leze ze smartu?

Vsechny disky jsou Seagate ST31000528AS
Debian, 64bit, kernel vlastni 2.6.33.3, disky v BIOSu nastaveny na AHCI
Jeden vadny disk bych bral ale dva najednou me trochu prekvapuji.


Chyby ze strany Linuxu vypadaly takto:
ata1.00: qc timeout (cmd 0x2f)
ata1: failed to read log page 10h (errno=-5)
ata1.00: exception Emask 0x1 SAct 0x1ff SErr 0x0 action 0x6 frozen
ata1.00: irq_stat 0x40000008
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/00:00:0c:0c:8d/01:00:71:00:00/40 tag 0 ncq 131072 in
         res 40/00:20:0c:0b:8d/00:00:71:00:00/40 Emask 0x1 (device error)
ata1.00: status: { DRDY }
.....
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
ata1.00: qc timeout (cmd 0x2f)
ata1: failed to read log page 10h (errno=-5)
ata1.00: exception Emask 0x1 SAct 0x1ff SErr 0x0 action 0x6 frozen
ata1.00: irq_stat 0x40000008
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/00:00:0c:12:8d/01:00:71:00:00/40 tag 0 ncq 131072 in
         res 40/00:40:0c:0c:8d/00:00:71:00:00/40 Emask 0x1 (device error)
ata1.00: status: { DRDY }
.....


Jeste pro doplneni - tohle je error log.

localhost:~# smartctl -l error /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8
Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 22 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 22 occurred at disk power-on lifetime: 51 hours (2 days + 3 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00   1d+02:49:01.027  READ DMA EXT
  27 00 00 00 00 00 e0 00   1d+02:49:01.027  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   1d+02:49:01.026  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   1d+02:49:01.006  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   1d+02:49:01.006  READ NATIVE MAX ADDRESS EXT

Error 21 occurred at disk power-on lifetime: 51 hours (2 days + 3 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00   1d+02:48:57.880  READ DMA EXT
  27 00 00 00 00 00 e0 00   1d+02:48:57.880  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   1d+02:48:57.879  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   1d+02:48:57.859  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   1d+02:48:57.859  READ NATIVE MAX ADDRESS EXT

Error 20 occurred at disk power-on lifetime: 51 hours (2 days + 3 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00   1d+02:48:54.708  READ DMA EXT
  27 00 00 00 00 00 e0 00   1d+02:48:54.707  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   1d+02:48:54.706  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   1d+02:48:54.687  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   1d+02:48:54.687  READ NATIVE MAX ADDRESS EXT

Error 19 occurred at disk power-on lifetime: 51 hours (2 days + 3 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 ff ff ff ef 00   1d+02:48:51.554  READ DMA EXT
  27 00 00 00 00 00 e0 00   1d+02:48:51.552  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   1d+02:48:51.551  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   1d+02:48:51.551  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   1d+02:48:51.531  READ NATIVE MAX ADDRESS EXT

Error 18 occurred at disk power-on lifetime: 51 hours (2 days + 3 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 e0 ff ff ff ef 00   1d+02:48:48.406  READ DMA EXT
  27 00 00 00 00 00 e0 00   1d+02:48:48.405  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00   1d+02:48:48.404  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00   1d+02:48:48.384  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00   1d+02:48:48.384  READ NATIVE MAX ADDRESS EXT

-- 
Petr Baláš - petr at balas dot cz



Další informace o konferenci Linux