[rhelv6-list] fsck -n always showing errors

Thu Dec 21 17:06:56 UTC 2017

I'd just do a rescue, doesn't even need to be EL-6, and do the fsck in rw
mode

On Dec 21, 2017 7:47 AM, "francis picabia" <fpicabia at gmail.com> wrote:

> Thanks for the replies...
>
> OK, I was expecting there must be some sort of false positive going on.
> For the system I listed here, those are not persistent errors.
>
> However there is one which does show the same orphaned inode numbers
> on each run, so this is likely real.
>
> # fsck -n /var
> fsck from util-linux-ng 2.17.2
> e2fsck 1.41.12 (17-May-2010)
> Warning!  /dev/sda2 is mounted.
> Warning: skipping journal recovery because doing a read-only filesystem
> check.
> /dev/sda2 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Deleted inode 1059654 has zero dtime.  Fix? no
>
> Inodes that were part of a corrupted orphan linked list found.  Fix? no
>
> Inode 1061014 was part of the orphaned inode list.  IGNORED.
> Inode 1061275 was part of the orphaned inode list.  IGNORED.
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Block bitmap differences:  -124293 -130887 -4244999 -4285460 -4979711
> -4984408 -4989489 -7052754 -7052847 -7053693 -7069384 -7069539 -7069657
> -7069788 -7074507 -(7095835--7095839) -7096847 -7097195 -9626336
> Fix? no
>
> Free blocks count wrong (6918236, counted=5214069).
> Fix? no
>
> Inode bitmap differences:  -1059654 -1061014 -1061275
> Fix? no
>
> Free inodes count wrong (1966010, counted=1878618).
> Fix? no
>
>
> /dev/sda2: ********** WARNING: Filesystem still has errors **********
>
> /dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000
> blocks
>
> dmesg shows it had some scsi issues.  I suspect the scsi error
> is triggered by operation of VDP backup, which freezes the system
> for a second when completing the backup snapshot.
>
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61ac0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e614c0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61cc0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e616c0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e615c0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0
> INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> jbd2/sda2-8   D 0000000000000000     0   752      2 0x00000000
>  ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb
>  ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f
>  ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8
> Call Trace:
>  [<ffffffff813a27eb>] ? scsi_request_fn+0xdb/0x750
>  [<ffffffff81014b39>] ? read_tsc+0x9/0x20
>  [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
>  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
>  [<ffffffff811d1440>] sync_buffer+0x40/0x50
>  [<ffffffff8154bbcf>] __wait_on_bit+0x5f/0x90
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
>  [<ffffffff8154bc78>] out_of_line_wait_on_bit+0x78/0x90
>  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
>  [<ffffffff810a67b7>] ? bit_waitqueue+0x17/0xd0
>  [<ffffffff811d13f6>] __wait_on_buffer+0x26/0x30
>  [<ffffffffa0180146>] jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2]
>  [<ffffffff8108fbdb>] ? try_to_del_timer_sync+0x7b/0xe0
>  [<ffffffffa0185a68>] kjournald2+0xb8/0x220 [jbd2]
>  [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffffa01859b0>] ? kjournald2+0x0/0x220 [jbd2]
>  [<ffffffff810a649e>] kthread+0x9e/0xc0
>  [<ffffffff8100c28a>] child_rip+0xa/0x20
>  [<ffffffff810a6400>] ? kthread+0x0/0xc0
>  [<ffffffff8100c280>] ? child_rip+0x0/0x20
> INFO: task master:1778 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> master        D 0000000000000000     0  1778      1 0x00000080
>  ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460
>  00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001
>  ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8
> Call Trace:
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
>  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
>  [<ffffffff811d1440>] sync_buffer+0x40/0x50
>  [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
>  [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
>  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
>  [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200
>  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
>  [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
>  [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
>  [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
>  [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
>  [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
>  [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
>  [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
>  [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0
>  [<ffffffff811b7102>] file_update_time+0xf2/0x170
>  [<ffffffff811a4f02>] pipe_write+0x312/0x6b0
>  [<ffffffff81199c2a>] do_sync_write+0xfa/0x140
>  [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffff8119f964>] ? cp_new_stat+0xe4/0x100
>  [<ffffffff81014b39>] ? read_tsc+0x9/0x20
>  [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100
>  [<ffffffff8123ae06>] ? security_file_permission+0x16/0x20
>  [<ffffffff81199f28>] vfs_write+0xb8/0x1a0
>  [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50
>  [<ffffffff8119aa61>] sys_write+0x51/0xb0
>  [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
>  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
> INFO: task pickup:1236 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> pickup        D 0000000000000001     0  1236   1778 0x00000080
>  ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120
>  ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120
>  ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8
> Call Trace:
>  [<ffffffff811456e0>] ? __lru_cache_add+0x40/0x90
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
>  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
>  [<ffffffff811d1440>] sync_buffer+0x40/0x50
>  [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
>  [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
>  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
>  [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200
>  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
>  [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
>  [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
>  [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
>  [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
>  [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
>  [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
>  [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
>  [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0
>  [<ffffffff811b7315>] touch_atime+0x195/0x1a0
>  [<ffffffff811a5684>] pipe_read+0x3e4/0x4d0
>  [<ffffffff81199d6a>] do_sync_read+0xfa/0x140
>  [<ffffffff811e2e80>] ? ep_send_events_proc+0x0/0x110
>  [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffff8123ae06>] ? security_file_permission+0x16/0x20
>  [<ffffffff8119a665>] vfs_read+0xb5/0x1a0
>  [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50
>  [<ffffffff8119a9b1>] sys_read+0x51/0xb0
>  [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
>  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036d7a680
> sd 2:0:0:0: [sda] Failed to get completion for aborted cmd ffff880036d7a680
> sd 2:0:0:0: [sda] SCSI device reset on scsi2:0
>
> If I just repair systems with that in their runtime history I should be on
> target
> for any concerns.
>
> Thanks for the responses...
>
>
> On Thu, Dec 21, 2017 at 9:09 AM, Cale Fairchild <cfairchild at brocku.ca>
> wrote:
>
>> Have you checked the filesystem from a rescue disk or does the fsck on
>> reboot report that it is fixing errors each time? As far as I understand
>> running `fsck -n /` on the active root filesystem will most always return
>> some errors as the blocks in the filesystem are changing while the fsck is
>> running it’s passes. Thus the warning at the beginning of the process about
>> the filesystem being mounted. Sorry if I am misunderstanding your process,
>> but if you have not tried checking the filesystem after booting into rescue
>> mode that would be a good step.
>>
>>
>>
>> *From:* rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces at re
>> dhat.com] *On Behalf Of *francis picabia
>> *Sent:* December 21, 2017 07:21
>> *To:* Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list <
>> rhelv6-list at redhat.com>
>> *Subject:* Re: [rhelv6-list] fsck -n always showing errors
>>
>>
>>
>> fsck -n is used to verify only.
>>
>> The touch on /forcefsck will force a regular fsck on unmounted
>>
>> partitions on boot up.
>>
>> So what I've done is:
>>
>> fsck -n
>>
>> touch /forcefsck
>>
>> reboot
>>
>> times three.
>>
>> It should be actually fixing the problems on reboot.
>>
>> I can find there are at least some fsck errors on every Redhat 6 machine,
>>
>> whether virtual or physical.  I mean I've tested the fsck -n status on
>> about
>>
>> twelve systems which have some errors.  Only 2 showed a history
>>
>> of SCSI errors, both happening to be VMware.
>>
>> Maybe some other people can test this on their Redhat 6 systems
>>
>> and see if fsck -n /var or similar comes back clean.  You might
>>
>> be surprised to see the same state I've noticed.  There is
>>
>> no issue like read-only file system.   Everything is functional.
>>
>>
>>
>>
>>
>> On Wed, Dec 20, 2017 at 5:57 PM, Gianluca Cecchi <
>> gianluca.cecchi at gmail.com> wrote:
>>
>>
>>
>> On Wed, Dec 20, 2017 at 9:27 PM, francis picabia <fpicabia at gmail.com>
>> wrote:
>>
>>
>>
>> With one development box I did touch /forcefsck and rebooted.
>>
>> Retested fsck and still issues.  Repeated this cycle 3 times
>>
>> and no improvement.
>>
>>
>>
>> Hi,
>>
>> not going into the reasons of the problem, but into your "cycle".
>>
>> if I have understood correctly your sentence, you run fsck and use "-n"
>> option that automatically answers "no" to all the questions related to
>> problems and suggestions to fix them.
>>
>> So, as you didn't fix anything, the next run the fsck command exposes the
>> same problems again....
>>
>>
>>
>> Sometimes I have seen in vSphere environments storage problems causing
>> linux VMs problems and so kernel to automatically put one or more
>> filesystems in read-only mode: typically the filesystems where there were
>> writes in action during the problem occurrence.
>>
>> So in your case it could be something similar with impact to all the VMs
>> insisting on the problematic storage / datastore
>>
>> If you have no monitoring in place, such as Nagios and a monitor like
>> this:
>>
>> https://exchange.nagios.org/directory/Plugins/Operating-Syst
>> ems/Linux/check_ro_mounts/details
>>
>> you can go ahead also some days before realizing that you had a problem
>>
>> Analyzing /var/log/messages you should see when it happened
>>
>>
>>
>> Take in mind that if the filesystem went in read-only mode due to a SCSI
>> error (action taken by the kernel to prevent further errors and data
>> corruption), you will not be able to remount it read-write, but you have to
>> reboot the server.
>>
>>
>>
>> Just a guess.
>>
>> HIH,
>>
>> Gianluca
>>
>>
>>
>>
>> _______________________________________________
>> rhelv6-list mailing list
>> rhelv6-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/rhelv6-list
>>
>>
>>
>> _______________________________________________
>> rhelv6-list mailing list
>> rhelv6-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/rhelv6-list
>>
>
>
> _______________________________________________
> rhelv6-list mailing list
> rhelv6-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rhelv6-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171221/9e8bc312/attachment.htm>