qnap now searching for defect harddisk blocks during setup assistent

31.Jul.2014

when a harddisk goes bad, i thought that the filesystem is marking the bad blocks as „bad“ and using other blocks.

never the lass i had the case where one samsung disk started having 2 bad blocks going undetected.

now image another harddisk fails completely and you want to pull a backup.

you won’t get all the data.

as i recommended in a forum post, the qnap shoudl check for bad blocks on harddisk before the raid goes into operation.

qnap einrichtungsassistent such nach fehlerhaften blöcken auf harddisk

 

 

you can check your harddisk in a non-destructive way for bad blocks via:

badblocks -sv /dev/sdb;

Checking blocks 0 to 1953514583
Checking for bad blocks (read-only test):   7.20% done, 18:33 elapsed
 16.29% done, 42:36 elapsed
done
Pass completed, 0 bad blocks found.
Checking blocks 0 to 1953514583
Checking for bad blocks (read-only test): done


badblocks -sv /dev/sda;

Pass completed, 0 bad blocks found.
Checking blocks 0 to 1953514583
Checking for bad blocks (read-only test): 946183680one, 2:50:06 elapsed
946183700one, 2:51:57 elapsed

12:20 31.07.2014

badblocks -sv /dev/sdc;
Checking blocks 0 to 1953514583
Checking for bad blocks (read-only test): ^[ xdone
Pass completed, 0 bad blocks found.


badblocks -sv /dev/sdd;
Checking blocks 0 to 1953514583
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.

badblocks -sv /dev/sde;

Pass completed, 2 bad blocks found.
badblocks: No such file or directory while trying to determine device size
badblocks: No such file or directory while trying to determine device size

the qnap nas should show a red light telling you that this harddisk has failed.

if you don’t want to throw away your harddisk (you should… usually it won’t take long until the whole harddisk dies)

credits: http://stephane.lesimple.fr/blog/2011-03-22/how-to-securely-keep-a-hard-drive-with-bad-blocks-in-a-raid-array.html#comments 

How to securely keep a hard drive with bad blocks in a raid array

I’m using a custom-made NAS at home, with two hard-drives of 1 To.
Some of the partitions of these 2 disks are organized in a RAID-1 setup using dmraid (and the 

<span class="text">mdadm</span>

 user-space tool).

This morning, I had some freaking lines in the 

<span class="text">dmesg</span>

. Any sysadmin sighs when he sees those (and cries if he doesn’t have backups, but all sysadmins have backups, right?)

root# dmesg
ata1.00: exception Emask 0x0 SAct 0x1ff SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000008
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/80:40:56:2b:a9/00:00:73:00:00/40 tag 8 ncq 65536 in
res 41/40:00:90:2b:a9/00:00:73:00:00/40 Emask 0x409 (media error) <F>
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1.00: configured for UDMA/133
sd 0:0:0:0: [sda] Unhandled sense code
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
73 a9 2b 90
sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error – auto reallocate failed
sd 0:0:0:0: [sda] CDB: Read(10): 28 00 73 a9 2b 56 00 00 80 00
end_request: I/O error, dev sda, sector 1940466576
ata1: EH complete

The “auto reallocate failed” part specially sucks.
A quick look at 

<span class="text">smartctl</span>

 on the faulty underlying drive of the raid1 showed a not-nice value of 13 for Offline_Uncorrectable.

Any sysadmin in any company would then proceed to just replace the faulty drive, and happily wait for the raid to resync. But when it’s your home drive, and we’re talking here about 13 faulty blocks out of several zillion blocks, it suddenly seems a bit stupid to just throw the hard drive, when 99.9999993% of the remaining blocks are perfectly okay (yes, this is the actual ratio).

I’m using EXT4 for this raid partition, so I wanted to take advantage of the badblocks mechanism of this filesystem, as mdraid doesn’t (yet?) have any such mechanism.

I ran a read of the entire partition to locate where the problems were, and a grep in dmesg determined that the bad blocks were between the logical sectors 1015773 and 1015818 of the partition (which uses a blocksize of 4K, as reported by

<span class="text">tune2fs -l</span>

).
So, I’ve taken a security margin, and went to blacklist the logical sectors from 1015700 to 1015900.

I’ve first made a list of all the inodes impacted, using 

<span class="text">debugfs</span>

:

root# seq 1015700 1015900 | sed -re ’s/^/icheck /‘ | debugfs /dev/sda32>/dev/null | awk ‚/^[0-9]+[[:space:]]+[0-9]+$/ {print $2}‘ | tee badinodes
125022
125022
125022
[… snip …]

Then searched the file names attached to those inodes:

root# sort -u badinodes | sed -re ’s/^/ncheck /‘ | debugfs /dev/sda3 2>/dev/null| awk ‚/^[0-9]/ { $1=““; print }‘ | tee badfiles
/usr/src/linux-headers-2.6.35-28/arch/microblaze/include
/usr/src/linux-headers-2.6.35-28/arch/microblaze/include/asm
/var/lib/mlocate/mlocate.db
[… snip …]

In my case, it was only non-critical system files, but if more important files are impacted, doing a backup from the good partition would probably be a good idea, just in case…

I rebooted on a live system to be able to work with my root filesystem, and started the array with only the good drive in it.

root@PartedMagic # mdadm –assemble /dev/md0 /dev/sdb3
mdadm: /dev/md0 has been started with 1 drive (out of 2).

And used 

<span class="text">fsck</span>

 to manually add a list of the badblocks.

root@PartedMagic # seq 1015700 1015900 > badblocks
root@PartedMagic # fsck.ext4 -C 0 -l badblocks -y /dev/md0
e2fsck 1.41.11 (14-Mar-2010)
slash: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Running additional passes to resolve blocks claimed by more than one inode…
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 125022: 1015700 1015701 1015702 1015703 1015704 1015705 1015706 1015707 1015708 1015709 1015710 1015711 1015712 1015713 1015714 1015715 1015716 1015717 1015718 1015719 1015720 1015721 1015722 1015723 1015724 1015725 1015726 1015727 1015728 1015729 1015730 1015731 1015732 1015733 1015734 1015735 1015736 1015737 1015738 1015739 1015740 1015741 1015742 1015743
Multiply-claimed block(s) in inode 179315: 1015744
Multiply-claimed block(s) in inode 179316: 1015745
Multiply-claimed block(s) in inode 179317: 1015746
Multiply-claimed block(s) in inode 179318: 1015747
Multiply-claimed block(s) in inode 179319: 1015748
Multiply-claimed block(s) in inode 179320: 1015749
[… snip …]
Multiply-claimed block(s) in inode 179376: 1015805
Multiply-claimed block(s) in inode 179377: 1015806
Multiply-claimed block(s) in inode 179378: 1015807
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks
(There are 65 inodes containing multiply-claimed blocks.)
File /lib/modules/2.6.35-28-generic-pae/kernel/net/sunrpc/sunrpc.ko (inode #125022, mod time Tue Mar 1 15:57:40 2011)
has 44 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
File /usr/src/linux-headers-2.6.35-28/arch/cris/arch-v10/lib (inode #179315, mod time Sat Mar 19 05:32:10 2011)
has 1 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
File /usr/src/linux-headers-2.6.35-28/arch/cris/boot (inode #179316, mod time Sat Mar 19 05:32:10 2011)
has 1 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
File /usr/src/linux-headers-2.6.35-28/arch/cris/boot/compressed (inode #179317, mod time Sat Mar 19 05:32:10 2011)
has 1 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
[… snip …]
File /usr/src/linux-headers-2.6.35-28/arch/microblaze/lib (inode #179376, mod time Sat Mar 19 05:32:10 2011)
has 1 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
File /usr/src/linux-headers-2.6.35-28/arch/microblaze/include (inode #179377, mod time Sat Mar 19 05:32:02 2011)
has 1 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
File /usr/src/linux-headers-2.6.35-28/arch/microblaze/include/asm (inode #179378, mod time Sat Mar 19 05:32:10 2011)
has 1 multiply-claimed block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Tue Mar 22 20:36:03 2011)
Clone multiply-claimed blocks? yes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #0 (7663, counted=7555).
Fix? yes
Free blocks count wrong for group #30 (387, counted=495).
Fix? yesslash: ***** FILE SYSTEM WAS MODIFIED *****
slash: 261146/472352 files (0.2% non-contiguous), 1733488/1886240 blocks

Look as how fsck detects the inodes that are claiming the same blocks. This is totally normal, as some of the badblocks are associated to files (as we found above), hence are referenced in the badblocks inode AND in the real files inodes. The fsck fix is exactly what we need : it just duplicates the data block so that each inode has its own data block.

… wait, did it modified the files inodes or the badblocks inode ?

root@PartedMagic # dumpe2fs -b /dev/md0
dumpe2fs 1.41.11 (14-Mar-2010)
1015700
1015701
[… snip …]
1015899
1015900

Alright, fsck did exactly the right thing, it modified the real files inodes, by copying the data from the blocks in the badblocks to new unallocated blocks. As this has been done with the working disk of the raid array, the impacted files have not lost their integrity.

Just for fun, let’s check that the blocks we wanted to ban data from are indeed no longer used by any real inode :

root@PartedMagic # seq 1015700 1015900 | sed -re ’s/^/icheck /‘ | debugfs/dev/md0 2>/dev/null
debugfs: Block Inode number
1015700 <block not found>
debugfs: Block Inode number
1015701 <block not found>
debugfs: Block Inode number
1015702 <block not found>
debugfs: Block Inode number
1015703 <block not found>
debugfs: Block Inode number
1015704 <block not found>
debugfs: Block Inode number
1015705 <block not found>
debugfs: Block Inode number
1015706 <block not found>
debugfs: Block Inode number
1015707 <block not found>
[… snip …]

Okay !

We now just have to resync the array over the bad disk,

root@PartedMagic # mdadm –re-add /dev/md0 /dev/sda3
mdadm: re-added /dev/sda3

Wait for it to finish… (check /proc/mdstat to see the progression), then reboot ! :)

 

 

 another interesting article out of the google chache:

source: http://webcache.googleusercontent.com/search?q=cache:oCekGqhL9wMJ:www.sj-vs.net/forcing-a-hard-disk-to-reallocate-bad-sectors/+&cd=1&hl=en&ct=clnk

Forcing a hard disk to reallocate bad sectors

 

Sometimes a hard disk is hinting on an upcoming failure. Some disks start to make unexpected sounds, others are silent and only cause some noise in your syslog. In most cases the disk will automatically reallocate one or two damaged sectors and you should start planning on buying a new disk while your data is safe. However, sometimes the disk won’t automatically reallocate these sectors and you’ll have to do that manually yourself. Luckily, this doesn’t include any rocket science.

A few days ago, one of my disks reported some problems in my syslog while rebuilding a RAID5-array:

Jan 29 18:19:54 dragon kernel: [66774.973049] end_request: I/O error, dev sdb, sector 1261069669
Jan 29 18:19:54 dragon kernel: [66774.973054] raid5:md3: read error not correctable (sector 405431640 on sdb6).
Jan 29 18:19:54 dragon kernel: [66774.973059] raid5: Disk failure on sdb6, disabling device.

Jan 29 18:20:11 dragon kernel: [66792.180513] sd 3:0:0:0: [sdb] Unhandled sense code
Jan 29 18:20:11 dragon kernel: [66792.180516] sd 3:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 29 18:20:11 dragon kernel: [66792.180521] sd 3:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
Jan 29 18:20:11 dragon kernel: [66792.180547] sd 3:0:0:0: [sdb] Add. Sense: Unrecovered read error – auto reallocate failed
Jan 29 18:20:11 dragon kernel: [66792.180553] sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 4b 2a 6c 4c 00 00 c0 00
Jan 29 18:20:11 dragon kernel: [66792.180564] end_request: I/O error, dev sdb, sector 1261071601

Modern hard disk drives are equipped with a small amount of spare sectors to reallocate damaged sectors. However, a sector only gets relocated when a write operation fails. A failing read operation will, in most cases, only throw an I/O error. In the unlikely event a second read does succeed, some disks perform a auto-reallocation and data is preserved. In my case, the second read failed miserably (“Unrecovered read error – auto reallocate failed“).

The read errors were caused by a sync of a new RAID5 array, which was initially running in degraded mode (on /dev/sdb and /dev/sdc, with /dev/sdd missing). Obviously, mdadm kicked sdb out of the already degraded RAID5-array, leaving nothing but sdc. That’s not something to be very happy about…

The only solution to this problem, was to force sdb to dynamically relocate the damaged sectors. That way, mdadm wouldn’t encounter the read errors and the initial sync of the array would succeed.  A tool like hdparm can help you with forcing a disk to reallocate a sector, by simply issuing a write command to the damaged sector. First, check out the number of reallocated sectors on the disk:

$ smartctl -a /dev/sdb | grep -i reallocated

5 Reallocated_Sector_Ct   0×0033   100   100   005    Pre-fail  Always       –       0
196 Reallocated_Event_Count 0×0032   100   100   000    Old_age   Always       –       0

The zeroes at the end of the lines indicate that there are no reallocated sectors on /dev/sdb. Let’s check whether sector 1261069669 is really damaged:

$ hdparm –read-sector 1261069669 /dev/sdb

/dev/sdb: Input/Output error

Now, issue the write command (note that hdparm will completely bypass regular block layer read/write mechanisms) to the damaged sector(s). Note that the data on these sectors will be lost forever!

$ hdparm –write-sector 1261069669 /dev/sdb

/dev/sdc:
Use of –write-sector is VERY DANGEROUS.
You are trying to deliberately overwrite a low-level sector on the media
This is a BAD idea, and can easily result in total data loss.
Please supply the –yes-i-know-what-i-am-doing flag if you really want this.

Program aborted.

$ hdparm –write-sector 1261069669 –yes-i-know-what-i-am-doing /dev/sdb

/dev/sdb: re-writing sector 1261069669: succeeded

$hdparm –write-sector 1261071601 –yes-i-know-what-i-am-doing /dev/sdb

/dev/sdb: re-writing sector 1261071601: succeeded

Now, use hdparm again to check the availability of the reallocated sectors:

$ hdparm –read-sector 1261069669

/dev/sdb:
reading sector 1261069669: succeeded
(a lot of zeroes should follow)

And using SMART we can check whether the disk has registered two reallocated sectors:

$ smartctl -a /dev/sdb | grep -i reallocated

5 Reallocated_Sector_Ct   0×0033   100   100   005    Pre-fail  Always       –       2
196 Reallocated_Event_Count 0×0032   100   100   000    Old_age   Always       –       2

It’s actually quite simple to force mdadm to continue using sdb as if nothing ever happened:

$ mdadm –assemble –force /dev/md3 /dev/sdb6 /dev/sdc6

(mdadm will complain about being forced to increase the event counter of sdb6)

$ mdadm /dev/md3 –add /dev/sdd6

And a few minutes later, the array is as good as new!

8 thoughts on “Forcing a hard disk to reallocate bad sectors”

  1. Mateusz Korniak 
    Hi, great tutorial !First, I think it’s worth mentioning that it is possible to kick only affected partition from array by comparing LBA of sector with output of sfdisk -d /dev/sdd.Second, I wonder if full sync should be forced before adding partition back to array ?
    So I would execute:mdadm –zero-superblock /dev/sdd6

    before:

    mdadm /dev/md3 –add /dev/sdd6

    We do not want to partially zeroed /dev/sdd6 become part of workig array without full resync (due to usage of bitmap).

    Regards,

  2. Carlos D. Garza 
    Great article. It did a great job of demystifying bad sector relocation for me. My issue wasn’t raid related so I was a little more desperate to get the drive into a bootable state just long enough to move the important data off of it. It worked so thanks again.
  3. Reik Red 
    Great trick to use hdparm to write the sector.I have a case were using dd to overwrite the sector produces an IO error and no reallocation. But using hpdparm works and forces the reallocation. Amazing.
  4. Pingback: marcando setor defeituoso sem formatar
  5. Pingback: GaSiD.org.uk » Blog Archive » Dealing with I/O errors on Linux (including fun with software RAID)
  6. Pingback: Offline uncorrectable sectors
  7. Pingback: Solving system error -3009 when formatting hard drive (Yast, OpenSUSE) – PCR’s notepad
  8. Pingback: Unrecovered read error – auto reallocate failed | Promethix by Chris Law
admin