[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [rhelv6-list] fsck -n always showing errors



Thanks for the replies...

OK, I was expecting there must be some sort of false positive going on.
For the system I listed here, those are not persistent errors.

However there is one which does show the same orphaned inode numbers
on each run, so this is likely real.

# fsck -n /var
fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
Warning!  /dev/sda2 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/sda2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 1059654 has zero dtime.  Fix? no

Inodes that were part of a corrupted orphan linked list found.  Fix? no

Inode 1061014 was part of the orphaned inode list.  IGNORED.
Inode 1061275 was part of the orphaned inode list.  IGNORED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -124293 -130887 -4244999 -4285460 -4979711 -4984408 -4989489 -7052754 -7052847 -7053693 -7069384 -7069539 -7069657 -7069788 -7074507 -(7095835--7095839) -7096847 -7097195 -9626336
Fix? no

Free blocks count wrong (6918236, counted=5214069).
Fix? no

Inode bitmap differences:  -1059654 -1061014 -1061275
Fix? no

Free inodes count wrong (1966010, counted=1878618).
Fix? no


/dev/sda2: ********** WARNING: Filesystem still has errors **********

/dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000 blocks

dmesg shows it had some scsi issues.  I suspect the scsi error
is triggered by operation of VDP backup, which freezes the system
for a second when completing the backup snapshot.

sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61ac0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e614c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61cc0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e616c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e615c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0
INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/sda2-8   D 0000000000000000     0   752      2 0x00000000
 ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb
 ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f
 ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8
Call Trace:
 [<ffffffff813a27eb>] ? scsi_request_fn+0xdb/0x750
 [<ffffffff81014b39>] ? read_tsc+0x9/0x20
 [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
 [<ffffffff811d1440>] sync_buffer+0x40/0x50
 [<ffffffff8154bbcf>] __wait_on_bit+0x5f/0x90
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154bc78>] out_of_line_wait_on_bit+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
 [<ffffffff810a67b7>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff811d13f6>] __wait_on_buffer+0x26/0x30
 [<ffffffffa0180146>] jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2]
 [<ffffffff8108fbdb>] ? try_to_del_timer_sync+0x7b/0xe0
 [<ffffffffa0185a68>] kjournald2+0xb8/0x220 [jbd2]
 [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa01859b0>] ? kjournald2+0x0/0x220 [jbd2]
 [<ffffffff810a649e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a6400>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
INFO: task master:1778 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
master        D 0000000000000000     0  1778      1 0x00000080
 ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460
 00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001
 ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8
Call Trace:
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
 [<ffffffff811d1440>] sync_buffer+0x40/0x50
 [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
 [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200
 [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
 [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
 [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
 [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
 [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
 [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
 [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
 [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
 [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0
 [<ffffffff811b7102>] file_update_time+0xf2/0x170
 [<ffffffff811a4f02>] pipe_write+0x312/0x6b0
 [<ffffffff81199c2a>] do_sync_write+0xfa/0x140
 [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8119f964>] ? cp_new_stat+0xe4/0x100
 [<ffffffff81014b39>] ? read_tsc+0x9/0x20
 [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100
 [<ffffffff8123ae06>] ? security_file_permission+0x16/0x20
 [<ffffffff81199f28>] vfs_write+0xb8/0x1a0
 [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119aa61>] sys_write+0x51/0xb0
 [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
INFO: task pickup:1236 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pickup        D 0000000000000001     0  1236   1778 0x00000080
 ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120
 ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120
 ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8
Call Trace:
 [<ffffffff811456e0>] ? __lru_cache_add+0x40/0x90
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
 [<ffffffff811d1440>] sync_buffer+0x40/0x50
 [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
 [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200
 [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
 [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
 [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
 [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
 [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
 [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
 [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
 [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
 [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0
 [<ffffffff811b7315>] touch_atime+0x195/0x1a0
 [<ffffffff811a5684>] pipe_read+0x3e4/0x4d0
 [<ffffffff81199d6a>] do_sync_read+0xfa/0x140
 [<ffffffff811e2e80>] ? ep_send_events_proc+0x0/0x110
 [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8123ae06>] ? security_file_permission+0x16/0x20
 [<ffffffff8119a665>] vfs_read+0xb5/0x1a0
 [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119a9b1>] sys_read+0x51/0xb0
 [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
sd 2:0:0:0: [sda] task abort on host 2, ffff880036d7a680
sd 2:0:0:0: [sda] Failed to get completion for aborted cmd ffff880036d7a680
sd 2:0:0:0: [sda] SCSI device reset on scsi2:0

If I just repair systems with that in their runtime history I should be on target
for any concerns.

Thanks for the responses...


On Thu, Dec 21, 2017 at 9:09 AM, Cale Fairchild <cfairchild brocku ca> wrote:

Have you checked the filesystem from a rescue disk or does the fsck on reboot report that it is fixing errors each time? As far as I understand running `fsck -n /` on the active root filesystem will most always return some errors as the blocks in the filesystem are changing while the fsck is running it’s passes. Thus the warning at the beginning of the process about the filesystem being mounted. Sorry if I am misunderstanding your process, but if you have not tried checking the filesystem after booting into rescue mode that would be a good step.

 

From: rhelv6-list-bounces redhat com [mailto:rhelv6-list-bounces@redhat.com] On Behalf Of francis picabia
Sent: December 21, 2017 07:21
To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list <rhelv6-list redhat com>
Subject: Re: [rhelv6-list] fsck -n always showing errors

 

fsck -n is used to verify only.

The touch on /forcefsck will force a regular fsck on unmounted

partitions on boot up.

So what I've done is:

fsck -n

touch /forcefsck

reboot

times three.

It should be actually fixing the problems on reboot.

I can find there are at least some fsck errors on every Redhat 6 machine,

whether virtual or physical.  I mean I've tested the fsck -n status on about

twelve systems which have some errors.  Only 2 showed a history

of SCSI errors, both happening to be VMware.

Maybe some other people can test this on their Redhat 6 systems

and see if fsck -n /var or similar comes back clean.  You might

be surprised to see the same state I've noticed.  There is

no issue like read-only file system.   Everything is functional.

 

 

On Wed, Dec 20, 2017 at 5:57 PM, Gianluca Cecchi <gianluca cecchi gmail com> wrote:

 

On Wed, Dec 20, 2017 at 9:27 PM, francis picabia <fpicabia gmail com> wrote:

 

With one development box I did touch /forcefsck and rebooted.

Retested fsck and still issues.  Repeated this cycle 3 times

and no improvement.

 

Hi,

not going into the reasons of the problem, but into your "cycle".

if I have understood correctly your sentence, you run fsck and use "-n" option that automatically answers "no" to all the questions related to problems and suggestions to fix them.

So, as you didn't fix anything, the next run the fsck command exposes the same problems again....

 

Sometimes I have seen in vSphere environments storage problems causing linux VMs problems and so kernel to automatically put one or more filesystems in read-only mode: typically the filesystems where there were writes in action during the problem occurrence.

So in your case it could be something similar with impact to all the VMs insisting on the problematic storage / datastore

If you have no monitoring in place, such as Nagios and a monitor like this:

you can go ahead also some days before realizing that you had a problem

Analyzing /var/log/messages you should see when it happened

 

Take in mind that if the filesystem went in read-only mode due to a SCSI error (action taken by the kernel to prevent further errors and data corruption), you will not be able to remount it read-write, but you have to reboot the server.

 

Just a guess.

HIH,

Gianluca

 


_______________________________________________
rhelv6-list mailing list
rhelv6-list redhat com
https://www.redhat.com/mailman/listinfo/rhelv6-list

 


_______________________________________________
rhelv6-list mailing list
rhelv6-list redhat com
https://www.redhat.com/mailman/listinfo/rhelv6-list


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]