[rhelv6-list] fsck -n always showing errors

Thu Dec 21 15:46:59 UTC 2017

Thanks for the replies...

OK, I was expecting there must be some sort of false positive going on.
For the system I listed here, those are not persistent errors.

However there is one which does show the same orphaned inode numbers
on each run, so this is likely real.

# fsck -n /var
fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
Warning!  /dev/sda2 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem
check.
/dev/sda2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 1059654 has zero dtime.  Fix? no

Inodes that were part of a corrupted orphan linked list found.  Fix? no

Inode 1061014 was part of the orphaned inode list.  IGNORED.
Inode 1061275 was part of the orphaned inode list.  IGNORED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -124293 -130887 -4244999 -4285460 -4979711
-4984408 -4989489 -7052754 -7052847 -7053693 -7069384 -7069539 -7069657
-7069788 -7074507 -(7095835--7095839) -7096847 -7097195 -9626336
Fix? no

Free blocks count wrong (6918236, counted=5214069).
Fix? no

Inode bitmap differences:  -1059654 -1061014 -1061275
Fix? no

Free inodes count wrong (1966010, counted=1878618).
Fix? no

/dev/sda2: ********** WARNING: Filesystem still has errors **********

/dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000
blocks

dmesg shows it had some scsi issues.  I suspect the scsi error
is triggered by operation of VDP backup, which freezes the system
for a second when completing the backup snapshot.

sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61ac0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e614c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61cc0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e616c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e615c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0
INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/sda2-8   D 0000000000000000     0   752      2 0x00000000
 ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb
 ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f
 ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8
Call Trace:
 [<ffffffff813a27eb>] ? scsi_request_fn+0xdb/0x750
 [<ffffffff81014b39>] ? read_tsc+0x9/0x20
 [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
 [<ffffffff811d1440>] sync_buffer+0x40/0x50
 [<ffffffff8154bbcf>] __wait_on_bit+0x5f/0x90
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154bc78>] out_of_line_wait_on_bit+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
 [<ffffffff810a67b7>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff811d13f6>] __wait_on_buffer+0x26/0x30
 [<ffffffffa0180146>] jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2]
 [<ffffffff8108fbdb>] ? try_to_del_timer_sync+0x7b/0xe0
 [<ffffffffa0185a68>] kjournald2+0xb8/0x220 [jbd2]
 [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa01859b0>] ? kjournald2+0x0/0x220 [jbd2]
 [<ffffffff810a649e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a6400>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
INFO: task master:1778 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
master        D 0000000000000000     0  1778      1 0x00000080
 ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460
 00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001
 ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8
Call Trace:
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
 [<ffffffff811d1440>] sync_buffer+0x40/0x50
 [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
 [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200
 [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
 [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
 [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
 [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
 [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
 [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
 [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
 [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
 [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0
 [<ffffffff811b7102>] file_update_time+0xf2/0x170
 [<ffffffff811a4f02>] pipe_write+0x312/0x6b0
 [<ffffffff81199c2a>] do_sync_write+0xfa/0x140
 [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8119f964>] ? cp_new_stat+0xe4/0x100
 [<ffffffff81014b39>] ? read_tsc+0x9/0x20
 [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100
 [<ffffffff8123ae06>] ? security_file_permission+0x16/0x20
 [<ffffffff81199f28>] vfs_write+0xb8/0x1a0
 [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119aa61>] sys_write+0x51/0xb0
 [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
INFO: task pickup:1236 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pickup        D 0000000000000001     0  1236   1778 0x00000080
 ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120
 ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120
 ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8
Call Trace:
 [<ffffffff811456e0>] ? __lru_cache_add+0x40/0x90
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
 [<ffffffff811d1440>] sync_buffer+0x40/0x50
 [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
 [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200
 [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
 [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
 [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
 [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
 [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
 [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
 [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
 [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
 [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0
 [<ffffffff811b7315>] touch_atime+0x195/0x1a0
 [<ffffffff811a5684>] pipe_read+0x3e4/0x4d0
 [<ffffffff81199d6a>] do_sync_read+0xfa/0x140
 [<ffffffff811e2e80>] ? ep_send_events_proc+0x0/0x110
 [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8123ae06>] ? security_file_permission+0x16/0x20
 [<ffffffff8119a665>] vfs_read+0xb5/0x1a0
 [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119a9b1>] sys_read+0x51/0xb0
 [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
sd 2:0:0:0: [sda] task abort on host 2, ffff880036d7a680
sd 2:0:0:0: [sda] Failed to get completion for aborted cmd ffff880036d7a680
sd 2:0:0:0: [sda] SCSI device reset on scsi2:0

If I just repair systems with that in their runtime history I should be on
target
for any concerns.

Thanks for the responses...

On Thu, Dec 21, 2017 at 9:09 AM, Cale Fairchild <cfairchild at brocku.ca>
wrote:

> Have you checked the filesystem from a rescue disk or does the fsck on
> reboot report that it is fixing errors each time? As far as I understand
> running `fsck -n /` on the active root filesystem will most always return
> some errors as the blocks in the filesystem are changing while the fsck is
> running it’s passes. Thus the warning at the beginning of the process about
> the filesystem being mounted. Sorry if I am misunderstanding your process,
> but if you have not tried checking the filesystem after booting into rescue
> mode that would be a good step.
>
>
>
> *From:* rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces@
> redhat.com] *On Behalf Of *francis picabia
> *Sent:* December 21, 2017 07:21
> *To:* Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list <
> rhelv6-list at redhat.com>
> *Subject:* Re: [rhelv6-list] fsck -n always showing errors
>
>
>
> fsck -n is used to verify only.
>
> The touch on /forcefsck will force a regular fsck on unmounted
>
> partitions on boot up.
>
> So what I've done is:
>
> fsck -n
>
> touch /forcefsck
>
> reboot
>
> times three.
>
> It should be actually fixing the problems on reboot.
>
> I can find there are at least some fsck errors on every Redhat 6 machine,
>
> whether virtual or physical.  I mean I've tested the fsck -n status on
> about
>
> twelve systems which have some errors.  Only 2 showed a history
>
> of SCSI errors, both happening to be VMware.
>
> Maybe some other people can test this on their Redhat 6 systems
>
> and see if fsck -n /var or similar comes back clean.  You might
>
> be surprised to see the same state I've noticed.  There is
>
> no issue like read-only file system.   Everything is functional.
>
>
>
>
>
> On Wed, Dec 20, 2017 at 5:57 PM, Gianluca Cecchi <
> gianluca.cecchi at gmail.com> wrote:
>
>
>
> On Wed, Dec 20, 2017 at 9:27 PM, francis picabia <fpicabia at gmail.com>
> wrote:
>
>
>
> With one development box I did touch /forcefsck and rebooted.
>
> Retested fsck and still issues.  Repeated this cycle 3 times
>
> and no improvement.
>
>
>
> Hi,
>
> not going into the reasons of the problem, but into your "cycle".
>
> if I have understood correctly your sentence, you run fsck and use "-n"
> option that automatically answers "no" to all the questions related to
> problems and suggestions to fix them.
>
> So, as you didn't fix anything, the next run the fsck command exposes the
> same problems again....
>
>
>
> Sometimes I have seen in vSphere environments storage problems causing
> linux VMs problems and so kernel to automatically put one or more
> filesystems in read-only mode: typically the filesystems where there were
> writes in action during the problem occurrence.
>
> So in your case it could be something similar with impact to all the VMs
> insisting on the problematic storage / datastore
>
> If you have no monitoring in place, such as Nagios and a monitor like this:
>
> https://exchange.nagios.org/directory/Plugins/Operating-
> Systems/Linux/check_ro_mounts/details
>
> you can go ahead also some days before realizing that you had a problem
>
> Analyzing /var/log/messages you should see when it happened
>
>
>
> Take in mind that if the filesystem went in read-only mode due to a SCSI
> error (action taken by the kernel to prevent further errors and data
> corruption), you will not be able to remount it read-write, but you have to
> reboot the server.
>
>
>
> Just a guess.
>
> HIH,
>
> Gianluca
>
>
>
>
> _______________________________________________
> rhelv6-list mailing list
> rhelv6-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rhelv6-list
>
>
>
> _______________________________________________
> rhelv6-list mailing list
> rhelv6-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rhelv6-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171221/62a73536/attachment.htm>