[rhelv6-list] fsck -n always showing errors

francis picabia fpicabia at gmail.com
Fri Dec 22 19:43:14 UTC 2017


On Fri, Dec 22, 2017 at 12:29 PM, Brown, Hugh M <hugh-brown at uiowa.edu>
wrote:

> Response at bottom
>
> -----Original Message-----
> From: rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces@
> redhat.com] On Behalf Of francis picabia
> Sent: Thursday, December 21, 2017 9:47 AM
> To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list <
> rhelv6-list at redhat.com>
> Subject: Re: [rhelv6-list] fsck -n always showing errors
>
> Thanks for the replies...
>
>
> OK, I was expecting there must be some sort of false positive going on.
>
> For the system I listed here, those are not persistent errors.
>
>
> However there is one which does show the same orphaned inode numbers
>
> on each run, so this is likely real.
>
> # fsck -n /var
> fsck from util-linux-ng 2.17.2
> e2fsck 1.41.12 (17-May-2010)
> Warning!  /dev/sda2 is mounted.
> Warning: skipping journal recovery because doing a read-only filesystem
> check.
> /dev/sda2 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes Deleted inode 1059654 has zero
> dtime.  Fix? no
>
> Inodes that were part of a corrupted orphan linked list found.  Fix? no
>
> Inode 1061014 was part of the orphaned inode list.  IGNORED.
> Inode 1061275 was part of the orphaned inode list.  IGNORED.
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information Block bitmap differences:
> -124293 -130887 -4244999 -4285460 -4979711 -4984408 -4989489 -7052754
> -7052847 -7053693 -7069384 -7069539 -7069657 -7069788 -7074507
> -(7095835--7095839) -7096847 -7097195 -9626336 Fix? no
>
> Free blocks count wrong (6918236, counted=5214069).
> Fix? no
>
> Inode bitmap differences:  -1059654 -1061014 -1061275 Fix? no
>
> Free inodes count wrong (1966010, counted=1878618).
> Fix? no
>
>
> /dev/sda2: ********** WARNING: Filesystem still has errors **********
>
> /dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000
> blocks
>
>
> dmesg shows it had some scsi issues.  I suspect the scsi error
>
> is triggered by operation of VDP backup, which freezes the system
>
> for a second when completing the backup snapshot.
>
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0 sd 2:0:0:0: [sda]
> task abort on host 2, ffff880036e61ac0 sd 2:0:0:0: [sda] task abort on host
> 2, ffff880036e614c0 sd 2:0:0:0: [sda] task abort on host 2,
> ffff880036e61cc0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0 sd 2:0:0:0: [sda]
> task abort on host 2, ffff880036e616c0 sd 2:0:0:0: [sda] task abort on host
> 2, ffff880036e615c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0
> INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> jbd2/sda2-8   D 0000000000000000     0   752      2 0x00000000
>  ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb
>  ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f
>  ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8 Call
> Trace:
>  [<ffffffff813a27eb>] ? scsi_request_fn+0xdb/0x750  [<ffffffff81014b39>] ?
> read_tsc+0x9/0x20  [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100
> [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154b0e3>]
> io_schedule+0x73/0xc0  [<ffffffff811d1440>] sync_buffer+0x40/0x50
> [<ffffffff8154bbcf>] __wait_on_bit+0x5f/0x90  [<ffffffff811d1400>] ?
> sync_buffer+0x0/0x50  [<ffffffff8154bc78>] out_of_line_wait_on_bit+0x78/0x90
> [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff810a67b7>] ?
> bit_waitqueue+0x17/0xd0  [<ffffffff811d13f6>] __wait_on_buffer+0x26/0x30
> [<ffffffffa0180146>] jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2]
> [<ffffffff8108fbdb>] ? try_to_del_timer_sync+0x7b/0xe0
> [<ffffffffa0185a68>] kjournald2+0xb8/0x220 [jbd2]  [<ffffffff810a6930>] ?
> autoremove_wake_function+0x0/0x40  [<ffffffffa01859b0>] ?
> kjournald2+0x0/0x220 [jbd2]  [<ffffffff810a649e>] kthread+0x9e/0xc0
> [<ffffffff8100c28a>] child_rip+0xa/0x20  [<ffffffff810a6400>] ?
> kthread+0x0/0xc0  [<ffffffff8100c
>  280>] ? child_rip+0x0/0x20
> INFO: task master:1778 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> master        D 0000000000000000     0  1778      1 0x00000080
>  ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460
>  00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001
>  ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8 Call
> Trace:
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154b0e3>]
> io_schedule+0x73/0xc0  [<ffffffff811d1440>] sync_buffer+0x40/0x50
> [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0  [<ffffffff811d1400>] ?
> sync_buffer+0x0/0x50  [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+
> 0x78/0x90
>  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff811d0999>] ?
> __find_get_block+0xa9/0x200  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
> [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
> [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
> [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
> [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
> [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
> [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
> [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
> [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0  [<ffffffff811b7102>]
> file_update_time+0xf2/0x170  [<ffffffff811a4f02>] pipe_write+0x312/0x6b0
> [<ffffffff81199c2a>] do_sync_write+0xfa/0x140  [<ffffffff810a6930>] ?
> autoremove_wake_function+0x0/0x40  [<ffffffff8119f964>] ?
> cp_new_stat+0xe4/0x100  [<ffffffff81014b39>] ? read_tsc+0x9/0x20
> [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100  [<ffffffff8123a
>  e06>] ? security_file_permission+0x16/0x20
>  [<ffffffff81199f28>] vfs_write+0xb8/0x1a0  [<ffffffff8119b416>] ?
> fget_light_pos+0x16/0x50  [<ffffffff8119aa61>] sys_write+0x51/0xb0
> [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
> INFO: task pickup:1236 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> pickup        D 0000000000000001     0  1236   1778 0x00000080
>  ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120
>  ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120
>  ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8 Call
> Trace:
>  [<ffffffff811456e0>] ? __lru_cache_add+0x40/0x90  [<ffffffff811d1400>] ?
> sync_buffer+0x0/0x50  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
> [<ffffffff811d1440>] sync_buffer+0x40/0x50  [<ffffffff8154b99a>]
> __wait_on_bit_lock+0x5a/0xc0  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
> [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
>  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff811d0999>] ?
> __find_get_block+0xa9/0x200  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
> [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
> [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
> [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
> [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
> [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
> [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
> [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
> [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0  [<ffffffff811b7315>]
> touch_atime+0x195/0x1a0  [<ffffffff811a5684>] pipe_read+0x3e4/0x4d0
> [<ffffffff81199d6a>] do_sync_read+0xfa/0x140  [<ffffffff811e2e80>] ?
> ep_send_events_proc+0x0/0x110  [<ffffffff810a6930>] ?
> autoremove_wake_function+0x0/0x40  [<ffffffff8123ae06>] ?
> security_file_permission+0x16/0x20
>  [<ffffffff8119a665>] vfs_read+0xb5/0x1a0  [<ffffffff8119b416>] ?
> fget_light_pos+0x16/0x50  [<ffffffff8119a9b1>] sys_read+0x51/0xb0
> [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b sd 2:0:0:0: [sda] task
> abort on host 2, ffff880036d7a680 sd 2:0:0:0: [sda] Failed to get
> completion for aborted cmd ffff880036d7a680 sd 2:0:0:0: [sda] SCSI device
> reset on scsi2:0
>
>
> If I just repair systems with that in their runtime history I should be on
> target for any concerns.
>
>
> Thanks for the responses...
>
>
>
>
> I've never really had fsck fail to correct errors when run manually. I
> have had the touch /forcefsck && reboot option decide that a fix was too
> risky and refuse to do it. The manual run would then fix it. Typically
> booting single user mode was enough to sort it out. If the problem disk was
> the root fs, then rescue media was the solution.
>
> We did have an iscsi array reboot which caused the filesystem to go
> read-only and at the time, we ran fsck -n to check for any errors. We did
> get a few errors of the type that you'd expect from a filesystem that is
> mounted, but not any inode or bitmap errors.
>
> We also had a hyper-v vm get in a wedged state because the backup
> mechanism called the filesystem freeze (fsfreeze) and then the backup
> software crashed and never unfroze the filesystem. We had to update the
> backup software and the hyper-v drivers for that.
>
> The only time I couldn't get fsck to behave was when a couple of systems
> had faulty RAM. In those cases the filesystem corruption was severe and it
> was easier to replace memory and reimage/restore from backups.
>
> So, I don't think fsck is showing false positives. You should be able to
> clear the errors with a manual fsck and I would definitely be concerned
> that a number of systems were showing fs errors.
>
> If you can't get the manual fsck to fix all of the errors, it might be
> worth opening a support ticket with RedHat.
>
> Hugh
>
>
>
This topic has a degree of "You Mileage May Vary".  Yes, some file system
problems with real physical disk
errors will be difficult or sometimes even impossible to recover from.  It
depends how serious
the flaws are.  If it is only a power loss situation, then the transaction
rollback should do the trick.   If it
is media or a controller or other hardware causing interrupts, it is
anything from 4 to 9 on the
Richter scale - maybe just some loss in the log files, or database files
might be corrupted.

I say fsck is showing false positives because it is doing evaluations while
the file system is changing.  For minor items like block size counts or
missing
mod time, this would be typical of looking at the file system while there
are writes.

If you doubt it, try fsck -n /var on a system having an active website or
database.
See for yourself system what it reports.  I can tell you almost every
system's /var checked
with Redhat was not clean.  I'm talking about production servers and the
like,
not one's relatively quiet desktop system.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171222/a43db161/attachment.htm>


More information about the rhelv6-list mailing list