I just rebooted my monitoring server for the first time in a while, and the following starting filling the screen:
Jul 11 23:52:30 monit kernel: [ 25.255908] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 11 23:52:30 monit kernel: [ 25.256170] ata1.00: BMDMA stat 0x24
Jul 11 23:52:30 monit kernel: [ 25.256278] ata1.00: failed command: READ DMA
Jul 11 23:52:30 monit kernel: [ 25.256410] ata1.00: cmd c8/00:c0:20:68:35/00:00:00:00:00/e0 tag 0 dma 98304 in
Jul 11 23:52:30 monit kernel: [ 25.256416] res 51/40:9f:41:68:35/00:00:00:00:00/e0 Emask 0x9 (media error)
Jul 11 23:52:30 monit kernel: [ 25.256809] ata1.00: status: { DRDY ERR }
Jul 11 23:52:30 monit kernel: [ 25.256933] ata1.00: error: { UNC }
Jul 11 23:52:30 monit kernel: [ 25.304388] ata1.00: configured for UDMA/66
Jul 11 23:52:30 monit kernel: [ 25.304430] ata1: EH complete
. . .
Jul 11 23:52:30 monit kernel: [ 25.552451] sd 0:0:0:0: [sda] Unhandled sense code
Jul 11 23:52:30 monit kernel: [ 25.552462] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 11 23:52:30 monit kernel: [ 25.552475] sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Jul 11 23:52:30 monit kernel: [ 25.552490] Descriptor sense data with sense descriptors (in hex):
Jul 11 23:52:30 monit kernel: [ 25.552498] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Jul 11 23:52:30 monit kernel: [ 25.552529] 00 35 68 41
Jul 11 23:52:30 monit kernel: [ 25.552543] sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Jul 11 23:52:30 monit kernel: [ 25.552559] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 35 68 20 00 00 c0 00
Jul 11 23:52:30 monit kernel: [ 25.552587] end_request: I/O error, dev sda, sector 3500097
Jul 11 23:52:30 monit kernel: [ 25.556607] ata1: EH complete
I already know I need to replace the HDD (Cost of Data > Cost of HDD), but I want to know for my own knowledge what’s actually wrong with it.
Yes, our monitoring server has no RAID, just one HDD… Don’t look at me…
Randall
3072 silver badges17 bronze badges
asked Jul 12, 2012 at 5:07
3
sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Looks like the drive has bad sectors and is unable to reallocate these (possibly because it’s run out of spare sectors). The output of smartctl -a /dev/sda
would give you more information on the state of the drive.
answered Jul 12, 2012 at 5:12
mgorvenmgorven
30.3k7 gold badges77 silver badges121 bronze badges
2
Lassie’s saying «arf! arf arf! arf!». Which is dumb, because this has nothing to do with Timmy or wells. This is why you don’t take sysadmin advice from dogs.
The drive is giving you an «Unrecovered read error — auto reallocate failed», which basically means «I tried to read, I failed, I tried to recover (read the sector a few more times, apply some ECC, and move the data to a sector that isn’t broken), and it didn’t work». This probably means (as mgorven says) that the disk is chock full of reallocated sectors already, because the disk’s been dying for a while, but I also think it can mean that it wasn’t able to recover the sector at all (repeated reads + ECC failed to get a good-looking data block).
Either way, yeah, the drive’s very, very cactus. Your data isn’t looking real healthy, either.
answered Jul 12, 2012 at 5:15
womble♦womble
95.6k29 gold badges173 silver badges229 bronze badges
1
I know this is old, but just in case someone is still reading this post: «DD will also try to read the broken sector(s)» — gddrescue is useful here. It doesn’t (okay, it does, but only once).
answered Apr 10, 2014 at 19:08
Make a dd image or rsync copy of that disk now++, unless you have a full backup allowing a convenient restore of that box. And start looking for a compatible and working replacement disk.
BTW, UDMA/66, is that a ten year old PATA disk?
answered Jul 12, 2012 at 7:25
3
As already mentioned it likely means your drive is nearing its end of life but not necessarily immediately — you should run an fsck
on the disk and try to repair the errors (see smartmontools wiki for advice fixing bad blocks) and the disk may be ok for a while longer.
But you should start running smartd
(which comes as part of the smartmontools
package) and keep an eye on its reports and/or set up email notifications. Also you can add custom notifications of your own by creating scripts (in /etc/smartmontools/run.d/
) that are called by the smartd-runner
.
answered Oct 25, 2017 at 19:44
PierzPierz
5837 silver badges9 bronze badges
I have a zpool (3x 3TB Western Digital Red) that I scrub weekly for errors that comes up OK, but I have a recurring error in my syslog:
Jul 23 14:00:41 server kernel: [1199443.374677] ata2.00: exception Emask 0x0 SAct 0xe000000 SErr 0x0 action 0x0
Jul 23 14:00:41 server kernel: [1199443.374738] ata2.00: irq_stat 0x40000008
Jul 23 14:00:41 server kernel: [1199443.374773] ata2.00: failed command: READ FPDMA QUEUED
Jul 23 14:00:41 server kernel: [1199443.374820] ata2.00: cmd 60/02:c8:26:fc:43/00:00:f9:00:00/40 tag 25 ncq 1024 in
Jul 23 14:00:41 server kernel: [1199443.374820] res 41/40:00:26:fc:43/00:00:f9:00:00/40 Emask 0x409 (media error) <F>
Jul 23 14:00:41 server kernel: [1199443.374946] ata2.00: status: { DRDY ERR }
Jul 23 14:00:41 server kernel: [1199443.374979] ata2.00: error: { UNC }
Jul 23 14:00:41 server kernel: [1199443.376100] ata2.00: configured for UDMA/133
Jul 23 14:00:41 server kernel: [1199443.376112] sd 1:0:0:0: [sda] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 23 14:00:41 server kernel: [1199443.376115] sd 1:0:0:0: [sda] tag#25 Sense Key : Medium Error [current] [descriptor]
Jul 23 14:00:41 server kernel: [1199443.376118] sd 1:0:0:0: [sda] tag#25 Add. Sense: Unrecovered read error - auto reallocate failed
Jul 23 14:00:41 server kernel: [1199443.376121] sd 1:0:0:0: [sda] tag#25 CDB: Read(16) 88 00 00 00 00 00 f9 43 fc 26 00 00 00 02 00 00
Jul 23 14:00:41 server kernel: [1199443.376123] blk_update_request: I/O error, dev sda, sector 4181982246
Jul 23 14:00:41 server kernel: [1199443.376194] ata2: EH complete
A while back I had a faulty SATA cable that caused some read/write errors (that were later corrected by zpool scrubs and restoring from snapshots) and originally thought this error was a result of this. However it keeps randomly recurring, this time while I was in the middle of a scrub.
So far ZFS says that there are no errors, but it also says it’s «repairing» that disk:
pool: sdb
state: ONLINE
scan: scrub in progress since Sun Jul 23 00:00:01 2017
5.41T scanned out of 7.02T at 98.9M/s, 4h44m to go
16.5K repaired, 77.06% done
config:
NAME STATE READ WRITE CKSUM
sdb ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N1366685 ONLINE 0 0 0 (repairing)
ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0K3PFPS ONLINE 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0M94AKN ONLINE 0 0 0
cache
sde ONLINE 0 0 0
errors: No known data errors
SMART data seems to tell me that everything is OK after running a short test, I’m in the middle of running the long self-test now to see if that comes up with anything. The only thing that jumps out is the UDMA_CRC_Error_Count
, but after I fixed that SATA cable it hasn’t increased at all.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 195 175 021 Pre-fail Always - 5233
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 625
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 069 069 000 Old_age Always - 22931
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 625
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 581
193 Load_Cycle_Count 0x0032 106 106 000 Old_age Always - 283773
194 Temperature_Celsius 0x0022 118 109 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 133 000 Old_age Always - 1801
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 22931 -
In addition to that, I’m also getting notifications about ZFS I/O errors, even though according to this it’s just a bug related to drive idling/spin up time.
eid: 71
class: io
host: server
time: 2017-07-23 15:57:49-0500
vtype: disk
vpath: /dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N1366685-part1
vguid: 0x979A2C1464C41735
cksum: 0
read: 0
write: 0
pool: sdb
My main question is how concerned should I be about that drive? I’m inclined to go replace it to be safe, but waned to know how soon I need to.
Here are the possibilities that I’m thinking might explain discrepancy between SMART data and ZFS/kernel:
- ZFS io error bug makes the kernel think that there’s bad sectors, but according to SMART there aren’t any.
- ZFS keeps repairing that drive (related to previous errors with faulty cable), which also might point to drive failure, despite SMART data.
- The error is a false alarm and related this unfixed bug in Ubuntu
EDIT: Now I just realized that the good drives are on firmware version 82.00A82, while the one that’s getting the errors is 80.00A80. According to the Western Digital forum, there’s no way to update this particular model’s firmware. I’m sure that’s not helping either.
EDIT 2: Forgot to update this a long time ago but this did end up being a hardware issue. After swapping multiple SATA cables, I finally realized that the issue the whole time was a failing power cable. The power flakiness was killing the drive, I but managed to get better drives and save the pool.
MY NTFS Partition has gotten corrupt somehow (it’s a relic from the days when I had Windows installed).
I’m putting the debug output of fdisk and blkid here.
At the same time, any OS is unable to mount my root partition, which is located next to my NTFS partition. I’m not sure if this has anything to do with it, though. I get the following error while trying to mount my root partition (sda5)
mount: wrong fs type, bad option, bad superblock on /dev/sda5,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
ubuntu@ubuntu:~$ dmesg | tail
[ 1019.726530] Descriptor sense data with sense descriptors (in hex):
[ 1019.726533] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[ 1019.726551] 1a 3e ed 92
[ 1019.726558] sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
[ 1019.726568] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 1a 3e ed 40 00 01 00 00
[ 1019.726584] end_request: I/O error, dev sda, sector 440331666
[ 1019.726602] JBD: Failed to read block at offset 462
[ 1019.726609] ata1: EH complete
[ 1019.726612] JBD: recovery failed
[ 1019.726617] EXT4-fs (sda5): error loading journal
When I open gparted (using live CD), I get an exclamation next to my NTFS drive which states
Is there a way to run chkdsk
without using windows ?
My attempt to run fsck
results in the following :
ubuntu@ubuntu:~$ sudo fsck /dev/sda
fsck from util-linux-ng 2.17.2
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext2: Superblock invalid, trying backup blocks...
fsck.ext2: Bad magic number in super-block while trying to open /dev/sda
The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
Update: I was able to fix the NTFS partition running chkdsk off Hiren’s BootCD, but it seems that the superblock problem still remains.
Update 2: Fixed superblock issue using e2fsck -c /dev/sda5
Эх. тоже немного уйду в оффтоп
Проблема с секторами ушла на нет, на созданный выше раздел заинсталлил ось, ребутнулся — все ок.
Но спустя некоторое время ось начала вставать колом, хард судя по всему потерялся на лету (собрать анамнез не удалось, т.к. ни одна утилита не стартовала, терминал, чей рабочий набор памяти жил в оперативе только беспомощно подмигивал курсором с периодическими фризами всего) После перезагрузки reset’ом — знакомая картинка:
Диска биос больше не нашел
Зацепил обратно к другому десктопу — все гуд, только автоматом восстановилось несколько потерянных inode судя по dmesg:
Код: Выделить всё
[ 69.116972] EXT4-fs (sda1): 273 orphan inodes deleted
[ 69.116976] EXT4-fs (sda1): recovery complete
[ 69.314627] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
Опять забил весь нулями — 372 гб пролетели без проблем
Код: Выделить всё
chocobo@desktop ~ $ ls -l /mnt/testfile
-rw-r--r-- 1 root root 390960017408 янв 20 20:08 /mnt/testfile
chocobo@desktop ~ $ df -h
Файл.система Размер Использовано Дост Использовано% Cмонтировано в
...
/dev/sda1 367G 367G 0 100% /mnt
Видимо пришла очередь прощаться с
PSU
. Комплексная проблема — коварная штука, вроде и хард был с unrecoverable секторами, так еще и БП чудит
Сейчас воткнул туда ноутбучный 2,5″ хард на 5400 rpm, которому мощей нужно поменьше — вот уже несколько часов работает, зараза
Долгое время не выдавалось интересных задачек, которые можно было бы освятить в заметке. Вчера всё-таки нашлась интересная с моей точки зрения.
В исходных данных имеем RAID1 массив, построенный на mdadm, оба диска в котором имеют ошибки на чтение в нескольких секторах. Система говорит о том, что в обоих из них Unrecovered read error — auto reallocate failed:
Aug 21 21:53:11 one kernel: [112350.663076] sd 2:0:0:0: [sda] Unhandled sense code
Aug 21 21:53:11 one kernel: [112350.663081] sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 21 21:53:11 one kernel: [112350.663089] sd 2:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Aug 21 21:53:11 one kernel: [112350.663133] sd 2:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Aug 21 21:53:11 one kernel: [112350.663144] sd 2:0:0:0: [sda] CDB: Read(10): 28 00 3a 38 53 b0 00 00 08 00
Aug 21 21:53:11 one kernel: [112350.663160] end_request: I/O error, dev sda, sector 976769972
Попытки пройти long тест в S.M.A.R.T не увенчиваются успехом. Вырезки из S.M.A.R.T
# smartctl -a /dev/sda ... Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 32520 976769972 ... # smartctl -a /dev/sdb ... Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 32519 55084971
Таким образом становиться ясно, что пока в массиве представлено 2-а жестких диска он «малость» работает исправно. Перекидывая неудачные попытки чтения с одного диска на другой диск. Это выражается в следующих сообщениях:
Aug 22 09:12:18 one kernel: [153097.501298] end_request: I/O error, dev sdb, sector 750558140 Aug 22 09:12:18 one kernel: [153097.508526] raid1: sdb4: rescheduling sector 695471160 Aug 22 09:12:18 one kernel: [153097.515736] raid1: sdb4: rescheduling sector 695471408 Aug 22 09:12:30 one kernel: [153109.603257] sd 4:0:0:0: [sdb] Unhandled sense code Aug 22 09:12:30 one kernel: [153109.603259] sd 4:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Aug 22 09:12:30 one kernel: [153109.603262] sd 4:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor] Aug 22 09:12:30 one kernel: [153109.603277] sd 4:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed Aug 22 09:12:30 one kernel: [153109.603281] sd 4:0:0:0: [sdb] CDB: Read(10): 28 00 2c bc 9b b5 00 00 08 00 Aug 22 09:12:30 one kernel: [153109.603287] end_request: I/O error, dev sdb, sector 750558140 Aug 22 09:12:30 one kernel: [153109.610682] raid1:md3: read error corrected (8 sectors at 695471248 on sdb4) Aug 22 09:12:30 one kernel: [153110.033913] raid1: sda4: redirecting sector 695471160 to another mirror Aug 22 09:12:31 one kernel: [153110.177597] raid1: sda4: redirecting sector 695471408 to another mirror
В случае окончательного выхода из строя одного из дисков массив никогда не сможет восстановиться, потому что чтение с любого диска заканчивается провалом.
# dd if=/dev/sda of=/dev/null bs=1M dd: reading `/dev/sda': Input/output error 476938+1 records in 476938+1 records out 500106223616 bytes (500 GB) copied, 5724.82 s, 87.4 MB/s
Разумеется, существует вариант сделать копию системы штатным tar архивом. После этого заменить жесткие диски и развернуть ранее сохраненную систему. Еще один из вариантов попытаться заменить один жесткий диск, создать на нем изначально RAID1 массив с одним missing диском, после этого попытаться синхронизировать систему с оставшегося диска с помощью tar или rsync.
Однако было замечено, что у дисков недоступные для чтения сектора находятся в различных местах. Таким образом возникла идея попытаться форсировать remapping испорченных секторов на одном из дисков и попытаться после этого восстановить данные в них с использованием другого полудохлого диска. В ходе этой операции мы получим один диск с полной копией наших данных, с которого можно эти данные полностью прочитать без ошибок. Останется заменить второй полудохлый диск новым, mdadm автоматически прочитает данные и синхронизирует их с новым диском. При желании после синхронизации можно заменить диск, на котором произвели remapping секторов.
Для того, чтобы форсировать жесткий диск произвести remapping секторов необходимо произвести запись в указанный сектор. С минимальной математической частью можно ознакомиться в статье Forcing a hard disk to reallocate bad sectors. В нашем случае эмпирическим путем было выяснено, что сектора с 976769972 по 976769979 имеют проблемы на жестком диске sda
# hdparm --read-sector 976769972 /dev/sda /dev/sda: reading sector 976769972: FAILED: Input/output error ... # hdparm --read-sector 976769979 /dev/sda /dev/sda: reading sector 976769979: FAILED: Input/output error #
Таким образом производим фиктивную запись в указанный диапазон секторов
for sector in $(seq 976769972 976769979); do hdparm --write-sector $sector --yes-i-know-what-i-am-doing /dev/sda; done
После этого можно убедиться, что данные сектора доступны для чтения (предыдущей командой командой мы их полностью обнулили)
for sector in $(seq 976769972 976769979); do hdparm --read-sector $sector /dev/sda; done
Теперь необходимо в указанные области записать информацию с соседнего диска из зеркала. Так как нам известно смещение 976769972 и что количество поврежденных секторов равно 8 копируем указанную область целым блоком. Обратите внимание копируем область с диска sdb, записываем на диск sda:
# dd if=/dev/sdb of=copy skip=976769972 count=8 8+0 records in 8+0 records out 4096 bytes (4.1 kB) copied, 9.3381e-05 s, 43.9 MB/s # dd if=copy of=/dev/sda seek=976769972 oflag=direct count=8 8+0 records in 8+0 records out 4096 bytes (4.1 kB) copied, 0.000874658 s, 4.7 MB/s
Обнуление через вызов hdparm
не является обязательным. Можно напрямую записать сектора требуемыми данными через dd
командой
for sector in $(seq 976769972 976769979); do dd if=/dev/sdb of=/dev/sda skip=$sector seek=$sector oflag=direct count=1 done
Осталось убедиться, что все данные с диска можно прочитать
dd if=/dev/sda of=/dev/null bs=1M 476940+1 records in 476940+1 records out 500107862016 bytes (500 GB) copied, 5622.88 s, 88.9 MB/s
Таким образом в ходе описанных манипуляций у нас есть условно рабочий жесткий диск sda, который мы оставляем в системе и меняем жесткий диск sdb. По завершении синхронизации на диск sdb производим замену sda. По завершении синхронизации имеем полностью исправный массив с целостными данными.
В заключении хочется отметить, что в поиске неисправных секторов незаменимой может оказаться команда
dd if=/dev/sda of=/dev/null bs=512 conv=sync,noerror
При желании можно перенаправить сохранение бинарной копии системы на отдельный носитель в этом случае ключ sync просто незаменим. В случае ошибок сбойные сектора на копии будут заполнены NUL (нулями). Таким образом полученная копия будет полностью соответствовать с точки зрения размера оригиналу.
Вторым незаменимым помощником может служить команда badblocks
. Важно помнить, что для ее вызова необходимо указать размер блока равный 512 байтам. По умолчанию программа используется значение 1024. Таким образом вызов будет выглядеть следующим образом
badblocks -v -b 512 /dev/sda
Программа будет считывать сектор за сектором и отображать информацию о секторах, которые не удалось прочитать. Например,
# badblocks -v -b 512 /dev/sda Checking blocks 0 to 1565565871 Checking for bad blocks (read-only test): 578915880 578915881 578915882 578915883 578915884 578915885 578915886 578915887 578915888 578915889 578915890 578915891 578915892 578915893 578915894 578915895
В этом случае остается сохранить вывод программы в файл и последовательно перезаписать полученные сектора с другого диска.
Интересные материалы для ознакомления:
- Bad block HOWTO for smartmontools
- Forcing a hard disk to reallocate bad sectors
- Linux — Repairing bad blocks on a RAID1 array with GPT
Introduction
The following material is intended to serve as an example and a reference guide to help spot when disk input/output errors coming from the hardware are creating problems for the backup agent, with somewhat varying error messages shown in the backup console and in PCS logs. The existence of hardware disk errors is not only a problem for the creation of backups — it can pose a hidden yet significant danger to the stability and operability of the customer’s machine, and can easily lead to data loss — so spotting those on time can be crucial.
Symptoms
Error messages in the Backup Console
Backup fails with «Common I/O error.»
Backup fails with «Cannot read the snapshot of the volume.»
Error messages in the mms and/or pcs logs:
Example:
————————
Error code: 21561347
Fields: {«$module»:»disk_bundle_lxa64_26077″}
Message: Backup has failed.
————————
Error code: 66596
Fields: {«$module»:»disk_bundle_lxa64_26077″}
Message: Failed to commit operations.
————————
Error code: 458755
Fields: {«$module»:»disk_bundle_lxa64_26077″}
Message: Read error.
————————
Error code: 5832708
Fields: {«$module»:»disk_bundle_lxa64_26077″,»device»:»/dev/mapper/pve-root»}
Message: Cannot read the snapshot of the volume.
————————
Error messages in the Linux kernel logs ( /var/log/messages files, outputs of dmesg command):
Some examples of I/O-related errors are listed below. This list is not exhaustive.
Keywords to look for
While disk, input-output and storage subsystem errors vary a lot depending on multiple factors (such as the version of the Linux kernel, the exact type of storage controller and storage attachment — some of those would look slightly different if e.g. virtual disks are used inside a hypervisor, or if a disk/volume is attached via iSCSI or Fibre Channel), there are several strings/messages and patterns to look for. This is not an exclusive list:
- ata x.yz … DRDY
- ata x.yz failed command
- WRITE FPDMA QUEUED
- READ FPDMA QUEUED
- print_req_error
- I/O error … <device is normally named, e.g. sda or sdb or sdc disk ID…> < sector(s) NNNN which cannot be read/written is usually mentioned>
- hostbyte
- driverbyte
- DRIVER_SENSE
- Sense Key : Medium Error
- Add. Sense: Unrecovered read error — auto reallocate failed
- ata <ID>: EH complete
Impact on backup and/or restore activities
Impact on backup and recovery activities varies, depending on what operation fails, how it fails, whether it fails every time or only occasionally: e.g. a bad area or sector on disk may not always be permanently bad — sometimes the hardware can recover/repair the lighter errors on its own, in the background; sometimes these errors only occur during unfavorable physical conditions such as excessive vibration in the server/computer/datacenter. It does matter what is stored in the problematic sectors or areas of disk — some parts containing critical LVM or file system metadata, or the OS bootloader and kernel, or the system swap partition/file/area, are usually more important than others.
If the issues are smaller, and do not affect critical areas, the backup agent’s engine may be able to automatically switch into sector-by-sector mode: this can be controlled via the Options sub-menu in the Backup Plan.
However, in practice, in most cases, the I/O errors are serious enough to make even sector-by-sector backups fail (always or intermittently).
Backup creation activities are affected during either the snapshot creation stage by snapapi26 kernel module, or during the actual reading of data from the snapshot in order to send it to the backup.
Restoring backups to problematic disks usually fails when data in the exact bad spots need to be overwritten, but if critical metadata of the LVM ort FS is corrupted/non-readable/non-writable, a wide variety of errors and messages may appear.
What to do (reactively AND proactively)
- Fix or replace the faulty hardware.
- Repair/resync hardware or software array (if using one).
- Periodically run fsck in a mode that checks the entire disk surface (all blocks). Consult Linux manpages (man fsck) on how to do this. Use the «badblocks» Linux utility.
- If using hardware or software RAID solutions, configure them to periodically scrub or do patrol reads to detect bad/unstable sectors and disks as early as possible.
- Use advanced file systems like ZFS and BTRFS which have native features to detect and (if configured properly) self-heal some of such errors.
- Take Entire-machine backups frequently enough.
Key takeaways
- It is often important to check the Linux kernel logs ( /var/log/messages files, dmesg output) for hardware errors when creating backups fails with snapshot errors, unspecified I/O errors, «cannot read…» errors, and similar. The kernel logs are, in such cases, much more precise than the fairly high-level (and often generic) error messages which the backup agent can and does report — it is a user-space application after all, and it cannot always «see» nor interpret low-level errors of the I/O subsystem.
- If the hardware cannot read it (reliably), then the Linux kernel/the OS will not be able to read it, then the backup agent will not be able to process the data correctly and thus the backup will keep failing until the hardware problems get fixed (most often, until the bad disk(s), cable(s), or HBA/RAID card/adapter card get replaced).
- The backup engine is NOT designed to be able to backup barely functioning/marginal storage hardware; it is NOT a specialized data recovery/disk repair software tool. Specialized data recovery tools can sometimes extract data from very unstable sectors, using specialized techniques like retrying a non-responding sector tens or hundreds of times.
- Forum
- The Ubuntu Forum Community
- Ubuntu Official Flavours Support
- General Help
- Boot failing due to disk read error
-
Boot failing due to disk read error
When I am booting into 14.04 I am getting a Add. sense unrecovered read error — auto reallocate failed. Is there a way to get around this problem to boot into the laptop, or do I need to first backup, reformat and reinstall? Thank you.
-
Re: Boot failing due to disk read error
It sounds like your Hardrive may be failing. I would check that out first and yes back up all that you can.
The reallocate error often means it’s trying to reallocate bad sectors on the H.D.Wireless script
Dave
Registered Linux User #462608
Morse Code an early Digital Mode.
-
Re: Boot failing due to disk read error
Try booting from a Live medium, open a terminal and run the fsck command on the affected drive/partition, eg
Code:
sudo fsck /dev/sda1
change the path to suit your set up.
Don’t try this on a mounted file system, use a Live DVD/USB
-
Re: Boot failing due to disk read error
-
Re: Boot failing due to disk read error
If I have email on evolution, would I be able rto get it off when I log in through a live CD? In order to be able to export the email to a transferable form, I need to log in to the hard drive right? This I cannot do..
Would appreciate any suggestions.Thanks.
-
Re: Boot failing due to disk read error
neela; Sure,
Depending on the hard drive health — howefield’s post #3 — should be possible to mount that hard drive from the liveDVD and copy off your files.
To give explicit direction, we will need to know what we are working with,
Post back — from that liveDVD(USB) the output — Between code tags — of terminal commands:Code:
sudo fdisk -lu sudo parted -l
code tag tutorial:
http://ubuntuforums.org/showthread.p…8#post12776168
such that we have the target identified. Then we mount the installed file system and access it to copy off the data to some external ( another USB thumb drive ?) media.ain’t nothing but a thing
-
Re: Boot failing due to disk read error
I was able to move stuff off of the hard disk. Thanks for the suggestions. I have had trouble with some of the email data. I think this is a separate issue, which I will post as a different thread.
Last edited by neela; March 20th, 2016 at 04:05 PM.
Reason: I think this is a different topic