[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[rhelv6-list] Unsubscribe





On Fri, Dec 22, 2017, 11:29 Brown, Hugh M <hugh-brown uiowa edu> wrote:
Response at bottom

-----Original Message-----
From: rhelv6-list-bounces redhat com [mailto:rhelv6-list-bounces redhat com] On Behalf Of francis picabia
Sent: Thursday, December 21, 2017 9:47 AM
To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list <rhelv6-list redhat com>
Subject: Re: [rhelv6-list] fsck -n always showing errors

Thanks for the replies...


OK, I was expecting there must be some sort of false positive going on.

For the system I listed here, those are not persistent errors.


However there is one which does show the same orphaned inode numbers

on each run, so this is likely real.

# fsck -n /var
fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
Warning!  /dev/sda2 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/sda2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes Deleted inode 1059654 has zero dtime.  Fix? no

Inodes that were part of a corrupted orphan linked list found.  Fix? no

Inode 1061014 was part of the orphaned inode list.  IGNORED.
Inode 1061275 was part of the orphaned inode list.  IGNORED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information Block bitmap differences:  -124293 -130887 -4244999 -4285460 -4979711 -4984408 -4989489 -7052754 -7052847 -7053693 -7069384 -7069539 -7069657 -7069788 -7074507 -(7095835--7095839) -7096847 -7097195 -9626336 Fix? no

Free blocks count wrong (6918236, counted=5214069).
Fix? no

Inode bitmap differences:  -1059654 -1061014 -1061275 Fix? no

Free inodes count wrong (1966010, counted=1878618).
Fix? no


/dev/sda2: ********** WARNING: Filesystem still has errors **********

/dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000 blocks


dmesg shows it had some scsi issues.  I suspect the scsi error

is triggered by operation of VDP backup, which freezes the system

for a second when completing the backup snapshot.

sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61ac0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e614c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61cc0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e616c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e615c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0
INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/sda2-8   D 0000000000000000     0   752      2 0x00000000
 ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb
 ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f
 ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8 Call Trace:
 [<ffffffff813a27eb>] ? scsi_request_fn+0xdb/0x750  [<ffffffff81014b39>] ? read_tsc+0x9/0x20  [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0  [<ffffffff811d1440>] sync_buffer+0x40/0x50  [<ffffffff8154bbcf>] __wait_on_bit+0x5f/0x90  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154bc78>] out_of_line_wait_on_bit+0x78/0x90  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff810a67b7>] ? bit_waitqueue+0x17/0xd0  [<ffffffff811d13f6>] __wait_on_buffer+0x26/0x30  [<ffffffffa0180146>] jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2]  [<ffffffff8108fbdb>] ? try_to_del_timer_sync+0x7b/0xe0  [<ffffffffa0185a68>] kjournald2+0xb8/0x220 [jbd2]  [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40  [<ffffffffa01859b0>] ? kjournald2+0x0/0x220 [jbd2]  [<ffffffff810a649e>] kthread+0x9e/0xc0  [<ffffffff8100c28a>] child_rip+0xa/0x20  [<ffffffff810a6400>] ? kthread+0x0/0xc0  [<ffffffff8100c
 280>] ? child_rip+0x0/0x20
INFO: task master:1778 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
master        D 0000000000000000     0  1778      1 0x00000080
 ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460
 00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001
 ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8 Call Trace:
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0  [<ffffffff811d1440>] sync_buffer+0x40/0x50  [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40  [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]  [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]  [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]  [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]  [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]  [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]  [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]  [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0  [<ffffffff811b7102>] file_update_time+0xf2/0x170  [<ffffffff811a4f02>] pipe_write+0x312/0x6b0  [<ffffffff81199c2a>] do_sync_write+0xfa/0x140  [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40  [<ffffffff8119f964>] ? cp_new_stat+0xe4/0x100  [<ffffffff81014b39>] ? read_tsc+0x9/0x20  [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100  [<ffffffff8123a
 e06>] ? security_file_permission+0x16/0x20
 [<ffffffff81199f28>] vfs_write+0xb8/0x1a0  [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50  [<ffffffff8119aa61>] sys_write+0x51/0xb0  [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
INFO: task pickup:1236 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pickup        D 0000000000000001     0  1236   1778 0x00000080
 ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120
 ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120
 ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8 Call Trace:
 [<ffffffff811456e0>] ? __lru_cache_add+0x40/0x90  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0  [<ffffffff811d1440>] sync_buffer+0x40/0x50  [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40  [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]  [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]  [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]  [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]  [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]  [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]  [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]  [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0  [<ffffffff811b7315>] touch_atime+0x195/0x1a0  [<ffffffff811a5684>] pipe_read+0x3e4/0x4d0  [<ffffffff81199d6a>] do_sync_read+0xfa/0x140  [<ffffffff811e2e80>] ? ep_send_events_proc+0x0/0x110  [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40  [<ffffffff8123ae06>] ? security_file_permission+0x16/0x20
 [<ffffffff8119a665>] vfs_read+0xb5/0x1a0  [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50  [<ffffffff8119a9b1>] sys_read+0x51/0xb0  [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b sd 2:0:0:0: [sda] task abort on host 2, ffff880036d7a680 sd 2:0:0:0: [sda] Failed to get completion for aborted cmd ffff880036d7a680 sd 2:0:0:0: [sda] SCSI device reset on scsi2:0


If I just repair systems with that in their runtime history I should be on target for any concerns.


Thanks for the responses...




I've never really had fsck fail to correct errors when run manually. I have had the touch /forcefsck && reboot option decide that a fix was too risky and refuse to do it. The manual run would then fix it. Typically booting single user mode was enough to sort it out. If the problem disk was the root fs, then rescue media was the solution.

We did have an iscsi array reboot which caused the filesystem to go read-only and at the time, we ran fsck -n to check for any errors. We did get a few errors of the type that you'd expect from a filesystem that is mounted, but not any inode or bitmap errors.

We also had a hyper-v vm get in a wedged state because the backup mechanism called the filesystem freeze (fsfreeze) and then the backup software crashed and never unfroze the filesystem. We had to update the backup software and the hyper-v drivers for that.

The only time I couldn't get fsck to behave was when a couple of systems had faulty RAM. In those cases the filesystem corruption was severe and it was easier to replace memory and reimage/restore from backups.

So, I don't think fsck is showing false positives. You should be able to clear the errors with a manual fsck and I would definitely be concerned that a number of systems were showing fs errors.

If you can't get the manual fsck to fix all of the errors, it might be worth opening a support ticket with RedHat.

Hugh



_______________________________________________
rhelv6-list mailing list
rhelv6-list redhat com
https://www.redhat.com/mailman/listinfo/rhelv6-list

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]