From fpicabia at gmail.com  Wed Dec 20 20:27:42 2017
From: fpicabia at gmail.com (francis picabia)
Date: Wed, 20 Dec 2017 16:27:42 -0400
Subject: [rhelv6-list] fsck -n always showing errors
Message-ID: <CA+AKB6GPzcV7MNetA3U3aew+KiuF92ebi4aKifHpVXBU3prhbg@mail.gmail.com>

The file systems are typically ext4.  Running current patches on Redhat 6.9.

This isn't something we routinely look at, but after
a couple of VMware systems showing scsi errors, I noticed almost
every Redhat 6 system will show some disk errors from
something like fsck -n / or same on /var

# fsck -n /
fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
Warning!  /dev/sda1 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem
check.
/dev/sda1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 22413413 has zero dtime.  Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -46708735 -(46712832--46712840) +79201706
+79201715 -79201836 -79202321 -(79234801--79234855) +(79234914--79234968)
Fix? no

Free blocks count wrong (42767651, counted=42765254).
Fix? no

Inode bitmap differences:  -22413413
Fix? no

Free inodes count wrong (25523840, counted=25523326).
Fix? no


/dev/sda1: ********** WARNING: Filesystem still has errors **********

/dev/sda1: 526720/26050560 files (0.2% non-contiguous), 61424349/104192000
blocks


The SAN backend for these VMs, nor the host boxes, have no alerts or
warnings.

With one development box I did touch /forcefsck and rebooted.
Retested fsck and still issues.  Repeated this cycle 3 times
and no improvement.

We have seen errors in the file system flagged like this on
physical systems and VMs.  They are not impacting
performance or booting, but given that two showed scsi
errors, I would think there is some possibility of data corruption.

This is too widespread to be any particular hardware system at fault.

I can't tell if there are bugs in ext4 or fsck is giving false positives.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171220/4c9c5ed8/attachment.htm>

From gianluca.cecchi at gmail.com  Wed Dec 20 21:57:26 2017
From: gianluca.cecchi at gmail.com (Gianluca Cecchi)
Date: Wed, 20 Dec 2017 22:57:26 +0100
Subject: [rhelv6-list] fsck -n always showing errors
In-Reply-To: <CA+AKB6GPzcV7MNetA3U3aew+KiuF92ebi4aKifHpVXBU3prhbg@mail.gmail.com>
References: <CA+AKB6GPzcV7MNetA3U3aew+KiuF92ebi4aKifHpVXBU3prhbg@mail.gmail.com>
Message-ID: <CAG2kNCxBQ+Uz+H3o6i7h2f0Z9gKcq86E-UUn1_gP7d_XJq80zg@mail.gmail.com>

On Wed, Dec 20, 2017 at 9:27 PM, francis picabia <fpicabia at gmail.com> wrote:

>
> With one development box I did touch /forcefsck and rebooted.
> Retested fsck and still issues.  Repeated this cycle 3 times
> and no improvement.
>

Hi,
not going into the reasons of the problem, but into your "cycle".
if I have understood correctly your sentence, you run fsck and use "-n"
option that automatically answers "no" to all the questions related to
problems and suggestions to fix them.
So, as you didn't fix anything, the next run the fsck command exposes the
same problems again....

Sometimes I have seen in vSphere environments storage problems causing
linux VMs problems and so kernel to automatically put one or more
filesystems in read-only mode: typically the filesystems where there were
writes in action during the problem occurrence.
So in your case it could be something similar with impact to all the VMs
insisting on the problematic storage / datastore
If you have no monitoring in place, such as Nagios and a monitor like this:
https://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_ro_mounts/details
you can go ahead also some days before realizing that you had a problem
Analyzing /var/log/messages you should see when it happened

Take in mind that if the filesystem went in read-only mode due to a SCSI
error (action taken by the kernel to prevent further errors and data
corruption), you will not be able to remount it read-write, but you have to
reboot the server.

Just a guess.
HIH,
Gianluca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171220/5f164718/attachment.htm>

From solarflow99 at gmail.com  Wed Dec 20 22:08:31 2017
From: solarflow99 at gmail.com (solarflow99)
Date: Wed, 20 Dec 2017 14:08:31 -0800
Subject: [rhelv6-list] fsck -n always showing errors
In-Reply-To: <CAG2kNCxBQ+Uz+H3o6i7h2f0Z9gKcq86E-UUn1_gP7d_XJq80zg@mail.gmail.com>
References: <CA+AKB6GPzcV7MNetA3U3aew+KiuF92ebi4aKifHpVXBU3prhbg@mail.gmail.com>
	<CAG2kNCxBQ+Uz+H3o6i7h2f0Z9gKcq86E-UUn1_gP7d_XJq80zg@mail.gmail.com>
Message-ID: <CAO8i5OL4AFVvEqSWZEGCvFGJF7B3mVfai-ZkkyYa-525DW-Zfg@mail.gmail.com>

thanks for the replies, nice to see some of us are still using the list.  I
sure wish they just gated email <-> forum instead of eliminating it.


On Wed, Dec 20, 2017 at 1:57 PM, Gianluca Cecchi <gianluca.cecchi at gmail.com>
wrote:

>
> On Wed, Dec 20, 2017 at 9:27 PM, francis picabia <fpicabia at gmail.com>
> wrote:
>
>>
>> With one development box I did touch /forcefsck and rebooted.
>> Retested fsck and still issues.  Repeated this cycle 3 times
>> and no improvement.
>>
>
> Hi,
> not going into the reasons of the problem, but into your "cycle".
> if I have understood correctly your sentence, you run fsck and use "-n"
> option that automatically answers "no" to all the questions related to
> problems and suggestions to fix them.
> So, as you didn't fix anything, the next run the fsck command exposes the
> same problems again....
>
> Sometimes I have seen in vSphere environments storage problems causing
> linux VMs problems and so kernel to automatically put one or more
> filesystems in read-only mode: typically the filesystems where there were
> writes in action during the problem occurrence.
> So in your case it could be something similar with impact to all the VMs
> insisting on the problematic storage / datastore
> If you have no monitoring in place, such as Nagios and a monitor like this:
> https://exchange.nagios.org/directory/Plugins/Operating-
> Systems/Linux/check_ro_mounts/details
> you can go ahead also some days before realizing that you had a problem
> Analyzing /var/log/messages you should see when it happened
>
> Take in mind that if the filesystem went in read-only mode due to a SCSI
> error (action taken by the kernel to prevent further errors and data
> corruption), you will not be able to remount it read-write, but you have to
> reboot the server.
>
> Just a guess.
> HIH,
> Gianluca
>
>
> _______________________________________________
> rhelv6-list mailing list
> rhelv6-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rhelv6-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171220/b908e8e4/attachment.htm>

From fpicabia at gmail.com  Thu Dec 21 12:20:56 2017
From: fpicabia at gmail.com (francis picabia)
Date: Thu, 21 Dec 2017 08:20:56 -0400
Subject: [rhelv6-list] fsck -n always showing errors
In-Reply-To: <CAG2kNCxBQ+Uz+H3o6i7h2f0Z9gKcq86E-UUn1_gP7d_XJq80zg@mail.gmail.com>
References: <CA+AKB6GPzcV7MNetA3U3aew+KiuF92ebi4aKifHpVXBU3prhbg@mail.gmail.com>
	<CAG2kNCxBQ+Uz+H3o6i7h2f0Z9gKcq86E-UUn1_gP7d_XJq80zg@mail.gmail.com>
Message-ID: <CA+AKB6FQFt6Lf-GSttzhmiDVTFh7i24bKB+ve-Nk+FGiweaTRQ@mail.gmail.com>

fsck -n is used to verify only.

The touch on /forcefsck will force a regular fsck on unmounted
partitions on boot up.

So what I've done is:

fsck -n
touch /forcefsck
reboot

times three.

It should be actually fixing the problems on reboot.

I can find there are at least some fsck errors on every Redhat 6 machine,
whether virtual or physical.  I mean I've tested the fsck -n status on about
twelve systems which have some errors.  Only 2 showed a history
of SCSI errors, both happening to be VMware.

Maybe some other people can test this on their Redhat 6 systems
and see if fsck -n /var or similar comes back clean.  You might
be surprised to see the same state I've noticed.  There is
no issue like read-only file system.   Everything is functional.


On Wed, Dec 20, 2017 at 5:57 PM, Gianluca Cecchi <gianluca.cecchi at gmail.com>
wrote:

>
> On Wed, Dec 20, 2017 at 9:27 PM, francis picabia <fpicabia at gmail.com>
> wrote:
>
>>
>> With one development box I did touch /forcefsck and rebooted.
>> Retested fsck and still issues.  Repeated this cycle 3 times
>> and no improvement.
>>
>
> Hi,
> not going into the reasons of the problem, but into your "cycle".
> if I have understood correctly your sentence, you run fsck and use "-n"
> option that automatically answers "no" to all the questions related to
> problems and suggestions to fix them.
> So, as you didn't fix anything, the next run the fsck command exposes the
> same problems again....
>
> Sometimes I have seen in vSphere environments storage problems causing
> linux VMs problems and so kernel to automatically put one or more
> filesystems in read-only mode: typically the filesystems where there were
> writes in action during the problem occurrence.
> So in your case it could be something similar with impact to all the VMs
> insisting on the problematic storage / datastore
> If you have no monitoring in place, such as Nagios and a monitor like this:
> https://exchange.nagios.org/directory/Plugins/Operating-
> Systems/Linux/check_ro_mounts/details
> you can go ahead also some days before realizing that you had a problem
> Analyzing /var/log/messages you should see when it happened
>
> Take in mind that if the filesystem went in read-only mode due to a SCSI
> error (action taken by the kernel to prevent further errors and data
> corruption), you will not be able to remount it read-write, but you have to
> reboot the server.
>
> Just a guess.
> HIH,
> Gianluca
>
>
> _______________________________________________
> rhelv6-list mailing list
> rhelv6-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rhelv6-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171221/651923b2/attachment.htm>

From cfairchild at brocku.ca  Thu Dec 21 13:09:03 2017
From: cfairchild at brocku.ca (Cale Fairchild)
Date: Thu, 21 Dec 2017 13:09:03 +0000
Subject: [rhelv6-list] fsck -n always showing errors
In-Reply-To: <CA+AKB6FQFt6Lf-GSttzhmiDVTFh7i24bKB+ve-Nk+FGiweaTRQ@mail.gmail.com>
References: <CA+AKB6GPzcV7MNetA3U3aew+KiuF92ebi4aKifHpVXBU3prhbg@mail.gmail.com>
	<CAG2kNCxBQ+Uz+H3o6i7h2f0Z9gKcq86E-UUn1_gP7d_XJq80zg@mail.gmail.com>
	<CA+AKB6FQFt6Lf-GSttzhmiDVTFh7i24bKB+ve-Nk+FGiweaTRQ@mail.gmail.com>
Message-ID: <DD4C919E4D618E428F06AA127B43857001F6DBEB20@MAILSTORE3.campus.brocku.local>

Have you checked the filesystem from a rescue disk or does the fsck on reboot report that it is fixing errors each time? As far as I understand running `fsck -n /` on the active root filesystem will most always return some errors as the blocks in the filesystem are changing while the fsck is running it?s passes. Thus the warning at the beginning of the process about the filesystem being mounted. Sorry if I am misunderstanding your process, but if you have not tried checking the filesystem after booting into rescue mode that would be a good step.

From: rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces at redhat.com] On Behalf Of francis picabia
Sent: December 21, 2017 07:21
To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list <rhelv6-list at redhat.com>
Subject: Re: [rhelv6-list] fsck -n always showing errors

fsck -n is used to verify only.
The touch on /forcefsck will force a regular fsck on unmounted
partitions on boot up.
So what I've done is:
fsck -n
touch /forcefsck
reboot
times three.
It should be actually fixing the problems on reboot.
I can find there are at least some fsck errors on every Redhat 6 machine,
whether virtual or physical.  I mean I've tested the fsck -n status on about
twelve systems which have some errors.  Only 2 showed a history
of SCSI errors, both happening to be VMware.
Maybe some other people can test this on their Redhat 6 systems
and see if fsck -n /var or similar comes back clean.  You might
be surprised to see the same state I've noticed.  There is
no issue like read-only file system.   Everything is functional.


On Wed, Dec 20, 2017 at 5:57 PM, Gianluca Cecchi <gianluca.cecchi at gmail.com<mailto:gianluca.cecchi at gmail.com>> wrote:

On Wed, Dec 20, 2017 at 9:27 PM, francis picabia <fpicabia at gmail.com<mailto:fpicabia at gmail.com>> wrote:

With one development box I did touch /forcefsck and rebooted.
Retested fsck and still issues.  Repeated this cycle 3 times
and no improvement.

Hi,
not going into the reasons of the problem, but into your "cycle".
if I have understood correctly your sentence, you run fsck and use "-n" option that automatically answers "no" to all the questions related to problems and suggestions to fix them.
So, as you didn't fix anything, the next run the fsck command exposes the same problems again....

Sometimes I have seen in vSphere environments storage problems causing linux VMs problems and so kernel to automatically put one or more filesystems in read-only mode: typically the filesystems where there were writes in action during the problem occurrence.
So in your case it could be something similar with impact to all the VMs insisting on the problematic storage / datastore
If you have no monitoring in place, such as Nagios and a monitor like this:
https://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_ro_mounts/details
you can go ahead also some days before realizing that you had a problem
Analyzing /var/log/messages you should see when it happened

Take in mind that if the filesystem went in read-only mode due to a SCSI error (action taken by the kernel to prevent further errors and data corruption), you will not be able to remount it read-write, but you have to reboot the server.

Just a guess.
HIH,
Gianluca


_______________________________________________
rhelv6-list mailing list
rhelv6-list at redhat.com<mailto:rhelv6-list at redhat.com>
https://www.redhat.com/mailman/listinfo/rhelv6-list

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171221/d11dcb44/attachment.htm>

From fpicabia at gmail.com  Thu Dec 21 15:46:59 2017
From: fpicabia at gmail.com (francis picabia)
Date: Thu, 21 Dec 2017 11:46:59 -0400
Subject: [rhelv6-list] fsck -n always showing errors
In-Reply-To: <DD4C919E4D618E428F06AA127B43857001F6DBEB20@MAILSTORE3.campus.brocku.local>
References: <CA+AKB6GPzcV7MNetA3U3aew+KiuF92ebi4aKifHpVXBU3prhbg@mail.gmail.com>
	<CAG2kNCxBQ+Uz+H3o6i7h2f0Z9gKcq86E-UUn1_gP7d_XJq80zg@mail.gmail.com>
	<CA+AKB6FQFt6Lf-GSttzhmiDVTFh7i24bKB+ve-Nk+FGiweaTRQ@mail.gmail.com>
	<DD4C919E4D618E428F06AA127B43857001F6DBEB20@MAILSTORE3.campus.brocku.local>
Message-ID: <CA+AKB6EcFObk9g0TTt0Aa3hmfdXRNgN-0N5+ZOJ9xPTtkeA+6Q@mail.gmail.com>

Thanks for the replies...

OK, I was expecting there must be some sort of false positive going on.
For the system I listed here, those are not persistent errors.

However there is one which does show the same orphaned inode numbers
on each run, so this is likely real.

# fsck -n /var
fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
Warning!  /dev/sda2 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem
check.
/dev/sda2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 1059654 has zero dtime.  Fix? no

Inodes that were part of a corrupted orphan linked list found.  Fix? no

Inode 1061014 was part of the orphaned inode list.  IGNORED.
Inode 1061275 was part of the orphaned inode list.  IGNORED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -124293 -130887 -4244999 -4285460 -4979711
-4984408 -4989489 -7052754 -7052847 -7053693 -7069384 -7069539 -7069657
-7069788 -7074507 -(7095835--7095839) -7096847 -7097195 -9626336
Fix? no

Free blocks count wrong (6918236, counted=5214069).
Fix? no

Inode bitmap differences:  -1059654 -1061014 -1061275
Fix? no

Free inodes count wrong (1966010, counted=1878618).
Fix? no


/dev/sda2: ********** WARNING: Filesystem still has errors **********

/dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000
blocks

dmesg shows it had some scsi issues.  I suspect the scsi error
is triggered by operation of VDP backup, which freezes the system
for a second when completing the backup snapshot.

sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61ac0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e614c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61cc0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e616c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e615c0
sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0
INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/sda2-8   D 0000000000000000     0   752      2 0x00000000
 ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb
 ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f
 ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8
Call Trace:
 [<ffffffff813a27eb>] ? scsi_request_fn+0xdb/0x750
 [<ffffffff81014b39>] ? read_tsc+0x9/0x20
 [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
 [<ffffffff811d1440>] sync_buffer+0x40/0x50
 [<ffffffff8154bbcf>] __wait_on_bit+0x5f/0x90
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154bc78>] out_of_line_wait_on_bit+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
 [<ffffffff810a67b7>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff811d13f6>] __wait_on_buffer+0x26/0x30
 [<ffffffffa0180146>] jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2]
 [<ffffffff8108fbdb>] ? try_to_del_timer_sync+0x7b/0xe0
 [<ffffffffa0185a68>] kjournald2+0xb8/0x220 [jbd2]
 [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa01859b0>] ? kjournald2+0x0/0x220 [jbd2]
 [<ffffffff810a649e>] kthread+0x9e/0xc0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a6400>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
INFO: task master:1778 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
master        D 0000000000000000     0  1778      1 0x00000080
 ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460
 00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001
 ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8
Call Trace:
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
 [<ffffffff811d1440>] sync_buffer+0x40/0x50
 [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
 [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200
 [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
 [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
 [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
 [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
 [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
 [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
 [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
 [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
 [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0
 [<ffffffff811b7102>] file_update_time+0xf2/0x170
 [<ffffffff811a4f02>] pipe_write+0x312/0x6b0
 [<ffffffff81199c2a>] do_sync_write+0xfa/0x140
 [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8119f964>] ? cp_new_stat+0xe4/0x100
 [<ffffffff81014b39>] ? read_tsc+0x9/0x20
 [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100
 [<ffffffff8123ae06>] ? security_file_permission+0x16/0x20
 [<ffffffff81199f28>] vfs_write+0xb8/0x1a0
 [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119aa61>] sys_write+0x51/0xb0
 [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
INFO: task pickup:1236 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pickup        D 0000000000000001     0  1236   1778 0x00000080
 ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120
 ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120
 ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8
Call Trace:
 [<ffffffff811456e0>] ? __lru_cache_add+0x40/0x90
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
 [<ffffffff811d1440>] sync_buffer+0x40/0x50
 [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
 [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
 [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200
 [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
 [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
 [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
 [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
 [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
 [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
 [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
 [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
 [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0
 [<ffffffff811b7315>] touch_atime+0x195/0x1a0
 [<ffffffff811a5684>] pipe_read+0x3e4/0x4d0
 [<ffffffff81199d6a>] do_sync_read+0xfa/0x140
 [<ffffffff811e2e80>] ? ep_send_events_proc+0x0/0x110
 [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8123ae06>] ? security_file_permission+0x16/0x20
 [<ffffffff8119a665>] vfs_read+0xb5/0x1a0
 [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50
 [<ffffffff8119a9b1>] sys_read+0x51/0xb0
 [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
sd 2:0:0:0: [sda] task abort on host 2, ffff880036d7a680
sd 2:0:0:0: [sda] Failed to get completion for aborted cmd ffff880036d7a680
sd 2:0:0:0: [sda] SCSI device reset on scsi2:0

If I just repair systems with that in their runtime history I should be on
target
for any concerns.

Thanks for the responses...


On Thu, Dec 21, 2017 at 9:09 AM, Cale Fairchild <cfairchild at brocku.ca>
wrote:

> Have you checked the filesystem from a rescue disk or does the fsck on
> reboot report that it is fixing errors each time? As far as I understand
> running `fsck -n /` on the active root filesystem will most always return
> some errors as the blocks in the filesystem are changing while the fsck is
> running it?s passes. Thus the warning at the beginning of the process about
> the filesystem being mounted. Sorry if I am misunderstanding your process,
> but if you have not tried checking the filesystem after booting into rescue
> mode that would be a good step.
>
>
>
> *From:* rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces@
> redhat.com] *On Behalf Of *francis picabia
> *Sent:* December 21, 2017 07:21
> *To:* Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list <
> rhelv6-list at redhat.com>
> *Subject:* Re: [rhelv6-list] fsck -n always showing errors
>
>
>
> fsck -n is used to verify only.
>
> The touch on /forcefsck will force a regular fsck on unmounted
>
> partitions on boot up.
>
> So what I've done is:
>
> fsck -n
>
> touch /forcefsck
>
> reboot
>
> times three.
>
> It should be actually fixing the problems on reboot.
>
> I can find there are at least some fsck errors on every Redhat 6 machine,
>
> whether virtual or physical.  I mean I've tested the fsck -n status on
> about
>
> twelve systems which have some errors.  Only 2 showed a history
>
> of SCSI errors, both happening to be VMware.
>
> Maybe some other people can test this on their Redhat 6 systems
>
> and see if fsck -n /var or similar comes back clean.  You might
>
> be surprised to see the same state I've noticed.  There is
>
> no issue like read-only file system.   Everything is functional.
>
>
>
>
>
> On Wed, Dec 20, 2017 at 5:57 PM, Gianluca Cecchi <
> gianluca.cecchi at gmail.com> wrote:
>
>
>
> On Wed, Dec 20, 2017 at 9:27 PM, francis picabia <fpicabia at gmail.com>
> wrote:
>
>
>
> With one development box I did touch /forcefsck and rebooted.
>
> Retested fsck and still issues.  Repeated this cycle 3 times
>
> and no improvement.
>
>
>
> Hi,
>
> not going into the reasons of the problem, but into your "cycle".
>
> if I have understood correctly your sentence, you run fsck and use "-n"
> option that automatically answers "no" to all the questions related to
> problems and suggestions to fix them.
>
> So, as you didn't fix anything, the next run the fsck command exposes the
> same problems again....
>
>
>
> Sometimes I have seen in vSphere environments storage problems causing
> linux VMs problems and so kernel to automatically put one or more
> filesystems in read-only mode: typically the filesystems where there were
> writes in action during the problem occurrence.
>
> So in your case it could be something similar with impact to all the VMs
> insisting on the problematic storage / datastore
>
> If you have no monitoring in place, such as Nagios and a monitor like this:
>
> https://exchange.nagios.org/directory/Plugins/Operating-
> Systems/Linux/check_ro_mounts/details
>
> you can go ahead also some days before realizing that you had a problem
>
> Analyzing /var/log/messages you should see when it happened
>
>
>
> Take in mind that if the filesystem went in read-only mode due to a SCSI
> error (action taken by the kernel to prevent further errors and data
> corruption), you will not be able to remount it read-write, but you have to
> reboot the server.
>
>
>
> Just a guess.
>
> HIH,
>
> Gianluca
>
>
>
>
> _______________________________________________
> rhelv6-list mailing list
> rhelv6-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rhelv6-list
>
>
>
> _______________________________________________
> rhelv6-list mailing list
> rhelv6-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rhelv6-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171221/62a73536/attachment.htm>

From solarflow99 at gmail.com  Thu Dec 21 17:06:56 2017
From: solarflow99 at gmail.com (solarflow99)
Date: Thu, 21 Dec 2017 09:06:56 -0800
Subject: [rhelv6-list] fsck -n always showing errors
In-Reply-To: <CA+AKB6EcFObk9g0TTt0Aa3hmfdXRNgN-0N5+ZOJ9xPTtkeA+6Q@mail.gmail.com>
References: <CA+AKB6GPzcV7MNetA3U3aew+KiuF92ebi4aKifHpVXBU3prhbg@mail.gmail.com>
	<CAG2kNCxBQ+Uz+H3o6i7h2f0Z9gKcq86E-UUn1_gP7d_XJq80zg@mail.gmail.com>
	<CA+AKB6FQFt6Lf-GSttzhmiDVTFh7i24bKB+ve-Nk+FGiweaTRQ@mail.gmail.com>
	<DD4C919E4D618E428F06AA127B43857001F6DBEB20@MAILSTORE3.campus.brocku.local>
	<CA+AKB6EcFObk9g0TTt0Aa3hmfdXRNgN-0N5+ZOJ9xPTtkeA+6Q@mail.gmail.com>
Message-ID: <CAO8i5O+5oDDpV2Fa0TRqOcwU0NBS3yPS2cgPXz1cLT2sBi4WSw@mail.gmail.com>

I'd just do a rescue, doesn't even need to be EL-6, and do the fsck in rw
mode


On Dec 21, 2017 7:47 AM, "francis picabia" <fpicabia at gmail.com> wrote:

> Thanks for the replies...
>
> OK, I was expecting there must be some sort of false positive going on.
> For the system I listed here, those are not persistent errors.
>
> However there is one which does show the same orphaned inode numbers
> on each run, so this is likely real.
>
> # fsck -n /var
> fsck from util-linux-ng 2.17.2
> e2fsck 1.41.12 (17-May-2010)
> Warning!  /dev/sda2 is mounted.
> Warning: skipping journal recovery because doing a read-only filesystem
> check.
> /dev/sda2 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes
> Deleted inode 1059654 has zero dtime.  Fix? no
>
> Inodes that were part of a corrupted orphan linked list found.  Fix? no
>
> Inode 1061014 was part of the orphaned inode list.  IGNORED.
> Inode 1061275 was part of the orphaned inode list.  IGNORED.
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Block bitmap differences:  -124293 -130887 -4244999 -4285460 -4979711
> -4984408 -4989489 -7052754 -7052847 -7053693 -7069384 -7069539 -7069657
> -7069788 -7074507 -(7095835--7095839) -7096847 -7097195 -9626336
> Fix? no
>
> Free blocks count wrong (6918236, counted=5214069).
> Fix? no
>
> Inode bitmap differences:  -1059654 -1061014 -1061275
> Fix? no
>
> Free inodes count wrong (1966010, counted=1878618).
> Fix? no
>
>
> /dev/sda2: ********** WARNING: Filesystem still has errors **********
>
> /dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000
> blocks
>
> dmesg shows it had some scsi issues.  I suspect the scsi error
> is triggered by operation of VDP backup, which freezes the system
> for a second when completing the backup snapshot.
>
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61ac0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e614c0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61cc0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e616c0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e615c0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0
> INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> jbd2/sda2-8   D 0000000000000000     0   752      2 0x00000000
>  ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb
>  ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f
>  ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8
> Call Trace:
>  [<ffffffff813a27eb>] ? scsi_request_fn+0xdb/0x750
>  [<ffffffff81014b39>] ? read_tsc+0x9/0x20
>  [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
>  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
>  [<ffffffff811d1440>] sync_buffer+0x40/0x50
>  [<ffffffff8154bbcf>] __wait_on_bit+0x5f/0x90
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
>  [<ffffffff8154bc78>] out_of_line_wait_on_bit+0x78/0x90
>  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
>  [<ffffffff810a67b7>] ? bit_waitqueue+0x17/0xd0
>  [<ffffffff811d13f6>] __wait_on_buffer+0x26/0x30
>  [<ffffffffa0180146>] jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2]
>  [<ffffffff8108fbdb>] ? try_to_del_timer_sync+0x7b/0xe0
>  [<ffffffffa0185a68>] kjournald2+0xb8/0x220 [jbd2]
>  [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffffa01859b0>] ? kjournald2+0x0/0x220 [jbd2]
>  [<ffffffff810a649e>] kthread+0x9e/0xc0
>  [<ffffffff8100c28a>] child_rip+0xa/0x20
>  [<ffffffff810a6400>] ? kthread+0x0/0xc0
>  [<ffffffff8100c280>] ? child_rip+0x0/0x20
> INFO: task master:1778 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> master        D 0000000000000000     0  1778      1 0x00000080
>  ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460
>  00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001
>  ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8
> Call Trace:
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
>  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
>  [<ffffffff811d1440>] sync_buffer+0x40/0x50
>  [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
>  [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
>  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
>  [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200
>  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
>  [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
>  [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
>  [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
>  [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
>  [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
>  [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
>  [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
>  [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0
>  [<ffffffff811b7102>] file_update_time+0xf2/0x170
>  [<ffffffff811a4f02>] pipe_write+0x312/0x6b0
>  [<ffffffff81199c2a>] do_sync_write+0xfa/0x140
>  [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffff8119f964>] ? cp_new_stat+0xe4/0x100
>  [<ffffffff81014b39>] ? read_tsc+0x9/0x20
>  [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100
>  [<ffffffff8123ae06>] ? security_file_permission+0x16/0x20
>  [<ffffffff81199f28>] vfs_write+0xb8/0x1a0
>  [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50
>  [<ffffffff8119aa61>] sys_write+0x51/0xb0
>  [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
>  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
> INFO: task pickup:1236 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> pickup        D 0000000000000001     0  1236   1778 0x00000080
>  ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120
>  ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120
>  ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8
> Call Trace:
>  [<ffffffff811456e0>] ? __lru_cache_add+0x40/0x90
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
>  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
>  [<ffffffff811d1440>] sync_buffer+0x40/0x50
>  [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
>  [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
>  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50
>  [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200
>  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
>  [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
>  [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
>  [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
>  [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
>  [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
>  [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
>  [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
>  [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0
>  [<ffffffff811b7315>] touch_atime+0x195/0x1a0
>  [<ffffffff811a5684>] pipe_read+0x3e4/0x4d0
>  [<ffffffff81199d6a>] do_sync_read+0xfa/0x140
>  [<ffffffff811e2e80>] ? ep_send_events_proc+0x0/0x110
>  [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffff8123ae06>] ? security_file_permission+0x16/0x20
>  [<ffffffff8119a665>] vfs_read+0xb5/0x1a0
>  [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50
>  [<ffffffff8119a9b1>] sys_read+0x51/0xb0
>  [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
>  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036d7a680
> sd 2:0:0:0: [sda] Failed to get completion for aborted cmd ffff880036d7a680
> sd 2:0:0:0: [sda] SCSI device reset on scsi2:0
>
> If I just repair systems with that in their runtime history I should be on
> target
> for any concerns.
>
> Thanks for the responses...
>
>
> On Thu, Dec 21, 2017 at 9:09 AM, Cale Fairchild <cfairchild at brocku.ca>
> wrote:
>
>> Have you checked the filesystem from a rescue disk or does the fsck on
>> reboot report that it is fixing errors each time? As far as I understand
>> running `fsck -n /` on the active root filesystem will most always return
>> some errors as the blocks in the filesystem are changing while the fsck is
>> running it?s passes. Thus the warning at the beginning of the process about
>> the filesystem being mounted. Sorry if I am misunderstanding your process,
>> but if you have not tried checking the filesystem after booting into rescue
>> mode that would be a good step.
>>
>>
>>
>> *From:* rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces at re
>> dhat.com] *On Behalf Of *francis picabia
>> *Sent:* December 21, 2017 07:21
>> *To:* Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list <
>> rhelv6-list at redhat.com>
>> *Subject:* Re: [rhelv6-list] fsck -n always showing errors
>>
>>
>>
>> fsck -n is used to verify only.
>>
>> The touch on /forcefsck will force a regular fsck on unmounted
>>
>> partitions on boot up.
>>
>> So what I've done is:
>>
>> fsck -n
>>
>> touch /forcefsck
>>
>> reboot
>>
>> times three.
>>
>> It should be actually fixing the problems on reboot.
>>
>> I can find there are at least some fsck errors on every Redhat 6 machine,
>>
>> whether virtual or physical.  I mean I've tested the fsck -n status on
>> about
>>
>> twelve systems which have some errors.  Only 2 showed a history
>>
>> of SCSI errors, both happening to be VMware.
>>
>> Maybe some other people can test this on their Redhat 6 systems
>>
>> and see if fsck -n /var or similar comes back clean.  You might
>>
>> be surprised to see the same state I've noticed.  There is
>>
>> no issue like read-only file system.   Everything is functional.
>>
>>
>>
>>
>>
>> On Wed, Dec 20, 2017 at 5:57 PM, Gianluca Cecchi <
>> gianluca.cecchi at gmail.com> wrote:
>>
>>
>>
>> On Wed, Dec 20, 2017 at 9:27 PM, francis picabia <fpicabia at gmail.com>
>> wrote:
>>
>>
>>
>> With one development box I did touch /forcefsck and rebooted.
>>
>> Retested fsck and still issues.  Repeated this cycle 3 times
>>
>> and no improvement.
>>
>>
>>
>> Hi,
>>
>> not going into the reasons of the problem, but into your "cycle".
>>
>> if I have understood correctly your sentence, you run fsck and use "-n"
>> option that automatically answers "no" to all the questions related to
>> problems and suggestions to fix them.
>>
>> So, as you didn't fix anything, the next run the fsck command exposes the
>> same problems again....
>>
>>
>>
>> Sometimes I have seen in vSphere environments storage problems causing
>> linux VMs problems and so kernel to automatically put one or more
>> filesystems in read-only mode: typically the filesystems where there were
>> writes in action during the problem occurrence.
>>
>> So in your case it could be something similar with impact to all the VMs
>> insisting on the problematic storage / datastore
>>
>> If you have no monitoring in place, such as Nagios and a monitor like
>> this:
>>
>> https://exchange.nagios.org/directory/Plugins/Operating-Syst
>> ems/Linux/check_ro_mounts/details
>>
>> you can go ahead also some days before realizing that you had a problem
>>
>> Analyzing /var/log/messages you should see when it happened
>>
>>
>>
>> Take in mind that if the filesystem went in read-only mode due to a SCSI
>> error (action taken by the kernel to prevent further errors and data
>> corruption), you will not be able to remount it read-write, but you have to
>> reboot the server.
>>
>>
>>
>> Just a guess.
>>
>> HIH,
>>
>> Gianluca
>>
>>
>>
>>
>> _______________________________________________
>> rhelv6-list mailing list
>> rhelv6-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/rhelv6-list
>>
>>
>>
>> _______________________________________________
>> rhelv6-list mailing list
>> rhelv6-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/rhelv6-list
>>
>
>
> _______________________________________________
> rhelv6-list mailing list
> rhelv6-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rhelv6-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171221/9e8bc312/attachment.htm>

From Tim.Mooney at ndsu.edu  Thu Dec 21 18:48:23 2017
From: Tim.Mooney at ndsu.edu (Tim Mooney)
Date: Thu, 21 Dec 2017 12:48:23 -0600 (CST)
Subject: [rhelv6-list] fsck -n always showing errors
In-Reply-To: <mailman.37920.1513858862.9262.rhelv6-list@redhat.com>
References: <mailman.37920.1513858862.9262.rhelv6-list@redhat.com>
Message-ID: <alpine.SOC.2.20.1712211239150.1205@dogbert.cc.ndsu.NoDak.edu>

In regard to: rhelv6-list Digest, Vol 76, Issue 1, rhelv6-list-request at redh...:

> This isn't something we routinely look at, but after
> a couple of VMware systems showing scsi errors, I noticed almost
> every Redhat 6 system will show some disk errors from
> something like fsck -n / or same on /var
>
> # fsck -n /
> fsck from util-linux-ng 2.17.2
> e2fsck 1.41.12 (17-May-2010)
> Warning!  /dev/sda1 is mounted.

There's your problem.  Don't run fsck on a mounted filesystem.  Even
with -n, it just shows you false positives.

Do some web searching for

 	fsck on a mounted filesystem

to understand why.

Tim
-- 
Tim Mooney                                             Tim.Mooney at ndsu.edu
Enterprise Computing & Infrastructure                  701-231-1076 (Voice)
Room 242-J6, Quentin Burdick Building                  701-231-8541 (Fax)
North Dakota State University, Fargo, ND 58105-5164



From fpicabia at gmail.com  Fri Dec 22 15:05:04 2017
From: fpicabia at gmail.com (francis picabia)
Date: Fri, 22 Dec 2017 11:05:04 -0400
Subject: [rhelv6-list] fsck -n always showing errors
In-Reply-To: <alpine.SOC.2.20.1712211239150.1205@dogbert.cc.ndsu.NoDak.edu>
References: <mailman.37920.1513858862.9262.rhelv6-list@redhat.com>
	<alpine.SOC.2.20.1712211239150.1205@dogbert.cc.ndsu.NoDak.edu>
Message-ID: <CA+AKB6HQeHDOC6E7u2Gwd5zajHU2sCVcGPbUboy4pLJb4DmQAA@mail.gmail.com>

On Thu, Dec 21, 2017 at 2:48 PM, Tim Mooney <Tim.Mooney at ndsu.edu> wrote:

> In regard to: rhelv6-list Digest, Vol 76, Issue 1,
> rhelv6-list-request at redh...:
>
> This isn't something we routinely look at, but after
>> a couple of VMware systems showing scsi errors, I noticed almost
>> every Redhat 6 system will show some disk errors from
>> something like fsck -n / or same on /var
>>
>> # fsck -n /
>> fsck from util-linux-ng 2.17.2
>> e2fsck 1.41.12 (17-May-2010)
>> Warning!  /dev/sda1 is mounted.
>>
>
> There's your problem.  Don't run fsck on a mounted filesystem.  Even
> with -n, it just shows you false positives.
>
> Do some web searching for
>
>         fsck on a mounted filesystem
>
> to understand why.
>
>

Well, I think they make the -n/-N flag in fsck for some purpose other than
don't do it.
It is designed to be run on a system to check it without modifying.

My conclusion is it is only useful for seeing an error such as orphaned
inodes which are persistent across multiple runs of fsck -n

If there are other checksums and such that don't seem correct, that would
be expected
on a live filesystem.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171222/b8136b5c/attachment.htm>

From KCollins at chevron.com  Fri Dec 22 15:42:26 2017
From: KCollins at chevron.com (Collins, Kevin)
Date: Fri, 22 Dec 2017 15:42:26 +0000
Subject: [rhelv6-list] fsck -n always showing errors
In-Reply-To: <CA+AKB6HQeHDOC6E7u2Gwd5zajHU2sCVcGPbUboy4pLJb4DmQAA@mail.gmail.com>
References: <mailman.37920.1513858862.9262.rhelv6-list@redhat.com>
	<alpine.SOC.2.20.1712211239150.1205@dogbert.cc.ndsu.NoDak.edu>
	<CA+AKB6HQeHDOC6E7u2Gwd5zajHU2sCVcGPbUboy4pLJb4DmQAA@mail.gmail.com>
Message-ID: <6F56410FBED1FC41BCA804E16F594B0B78C94B76@san520w8xmbx05.gdc0.chevron.net>

From ?man fsck?:

       -N     Don't execute, just show what would be done.

and:

       Options  to  different  filesystem-specific  fsck's  are not standardized.  If in doubt, please consult the man
       pages of the filesystem-specific checker.  Although not guaranteed, the following options are supported by most
       file system checkers:

?
?

       -n     For some filesystem-specific checkers, the -n option will cause the fs-specific fsck to avoid attempting
              to repair any problems, but simply report such problems to stdout.  This is however  not  true  for  all
              filesystem-specific  checkers.   In particular, fsck.reiserfs(8) will not report any corruption if given
              this option.  fsck.minix(8) does not support the -n option at all.

From ?man e2fsck?:

       -n     Open  the filesystem read-only, and assume an answer of 'no' to all questions.  Allows e2fsck to be used
              non-interactively.  This option may not be specified at the same time as the -p or -y options.

Kevin

From: rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces at redhat.com] On Behalf Of francis picabia
Sent: Friday, December 22, 2017 7:05 AM
To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list <rhelv6-list at redhat.com>
Subject: [**EXTERNAL**] Re: [rhelv6-list] fsck -n always showing errors


On Thu, Dec 21, 2017 at 2:48 PM, Tim Mooney <Tim.Mooney at ndsu.edu<mailto:Tim.Mooney at ndsu.edu>> wrote:
In regard to: rhelv6-list Digest, Vol 76, Issue 1, rhelv6-list-request at redh...:
This isn't something we routinely look at, but after
a couple of VMware systems showing scsi errors, I noticed almost
every Redhat 6 system will show some disk errors from
something like fsck -n / or same on /var

# fsck -n /
fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
Warning!  /dev/sda1 is mounted.

There's your problem.  Don't run fsck on a mounted filesystem.  Even
with -n, it just shows you false positives.

Do some web searching for

        fsck on a mounted filesystem

to understand why.

Well, I think they make the -n/-N flag in fsck for some purpose other than don't do it.
It is designed to be run on a system to check it without modifying.
My conclusion is it is only useful for seeing an error such as orphaned
inodes which are persistent across multiple runs of fsck -n
If there are other checksums and such that don't seem correct, that would be expected
on a live filesystem.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171222/85dcd3ba/attachment.htm>

From hugh-brown at uiowa.edu  Fri Dec 22 16:29:27 2017
From: hugh-brown at uiowa.edu (Brown, Hugh M)
Date: Fri, 22 Dec 2017 16:29:27 +0000
Subject: [rhelv6-list] fsck -n always showing errors
In-Reply-To: <CA+AKB6EcFObk9g0TTt0Aa3hmfdXRNgN-0N5+ZOJ9xPTtkeA+6Q@mail.gmail.com>
References: <CA+AKB6GPzcV7MNetA3U3aew+KiuF92ebi4aKifHpVXBU3prhbg@mail.gmail.com>
	<CAG2kNCxBQ+Uz+H3o6i7h2f0Z9gKcq86E-UUn1_gP7d_XJq80zg@mail.gmail.com>
	<CA+AKB6FQFt6Lf-GSttzhmiDVTFh7i24bKB+ve-Nk+FGiweaTRQ@mail.gmail.com>
	<DD4C919E4D618E428F06AA127B43857001F6DBEB20@MAILSTORE3.campus.brocku.local>
	<CA+AKB6EcFObk9g0TTt0Aa3hmfdXRNgN-0N5+ZOJ9xPTtkeA+6Q@mail.gmail.com>
Message-ID: <DM5PR04MB37712D52400FEF9A4EE5A71A9C020@DM5PR04MB3771.namprd04.prod.outlook.com>

Response at bottom

-----Original Message-----
From: rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces at redhat.com] On Behalf Of francis picabia
Sent: Thursday, December 21, 2017 9:47 AM
To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list <rhelv6-list at redhat.com>
Subject: Re: [rhelv6-list] fsck -n always showing errors

Thanks for the replies...


OK, I was expecting there must be some sort of false positive going on.

For the system I listed here, those are not persistent errors.


However there is one which does show the same orphaned inode numbers

on each run, so this is likely real.

# fsck -n /var
fsck from util-linux-ng 2.17.2
e2fsck 1.41.12 (17-May-2010)
Warning!  /dev/sda2 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/sda2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes Deleted inode 1059654 has zero dtime.  Fix? no

Inodes that were part of a corrupted orphan linked list found.  Fix? no

Inode 1061014 was part of the orphaned inode list.  IGNORED.
Inode 1061275 was part of the orphaned inode list.  IGNORED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information Block bitmap differences:  -124293 -130887 -4244999 -4285460 -4979711 -4984408 -4989489 -7052754 -7052847 -7053693 -7069384 -7069539 -7069657 -7069788 -7074507 -(7095835--7095839) -7096847 -7097195 -9626336 Fix? no

Free blocks count wrong (6918236, counted=5214069).
Fix? no

Inode bitmap differences:  -1059654 -1061014 -1061275 Fix? no

Free inodes count wrong (1966010, counted=1878618).
Fix? no


/dev/sda2: ********** WARNING: Filesystem still has errors **********

/dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000 blocks


dmesg shows it had some scsi issues.  I suspect the scsi error

is triggered by operation of VDP backup, which freezes the system

for a second when completing the backup snapshot.

sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61ac0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e614c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61cc0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e616c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e615c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0
INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/sda2-8   D 0000000000000000     0   752      2 0x00000000
 ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb
 ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f
 ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8 Call Trace:
 [<ffffffff813a27eb>] ? scsi_request_fn+0xdb/0x750  [<ffffffff81014b39>] ? read_tsc+0x9/0x20  [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0  [<ffffffff811d1440>] sync_buffer+0x40/0x50  [<ffffffff8154bbcf>] __wait_on_bit+0x5f/0x90  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154bc78>] out_of_line_wait_on_bit+0x78/0x90  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff810a67b7>] ? bit_waitqueue+0x17/0xd0  [<ffffffff811d13f6>] __wait_on_buffer+0x26/0x30  [<ffffffffa0180146>] jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2]  [<ffffffff8108fbdb>] ? try_to_del_timer_sync+0x7b/0xe0  [<ffffffffa0185a68>] kjournald2+0xb8/0x220 [jbd2]  [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40  [<ffffffffa01859b0>] ? kjournald2+0x0/0x220 [jbd2]  [<ffffffff810a649e>] kthread+0x9e/0xc0  [<ffffffff8100c28a>] child_rip+0xa/0x20  [<ffffffff810a6400>] ? kthread+0x0/0xc0  [<ffffffff8100c280>] ? child_rip+0x0/0x20
INFO: task master:1778 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
master        D 0000000000000000     0  1778      1 0x00000080
 ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460
 00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001
 ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8 Call Trace:
 [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0  [<ffffffff811d1440>] sync_buffer+0x40/0x50  [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40  [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]  [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]  [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]  [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]  [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]  [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]  [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]  [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0  [<ffffffff811b7102>] file_update_time+0xf2/0x170  [<ffffffff811a4f02>] pipe_write+0x312/0x6b0  [<ffffffff81199c2a>] do_sync_write+0xfa/0x140  [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40  [<ffffffff8119f964>] ? cp_new_stat+0xe4/0x100  [<ffffffff81014b39>] ? read_tsc+0x9/0x20  [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100  [<ffffffff8123ae06>] ? security_file_permission+0x16/0x20
 [<ffffffff81199f28>] vfs_write+0xb8/0x1a0  [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50  [<ffffffff8119aa61>] sys_write+0x51/0xb0  [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
INFO: task pickup:1236 blocked for more than 120 seconds.
      Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
pickup        D 0000000000000001     0  1236   1778 0x00000080
 ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120
 ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120
 ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8 Call Trace:
 [<ffffffff811456e0>] ? __lru_cache_add+0x40/0x90  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0  [<ffffffff811d1440>] sync_buffer+0x40/0x50  [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff811d0999>] ? __find_get_block+0xa9/0x200  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40  [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]  [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]  [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]  [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]  [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]  [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]  [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]  [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0  [<ffffffff811b7315>] touch_atime+0x195/0x1a0  [<ffffffff811a5684>] pipe_read+0x3e4/0x4d0  [<ffffffff81199d6a>] do_sync_read+0xfa/0x140  [<ffffffff811e2e80>] ? ep_send_events_proc+0x0/0x110  [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40  [<ffffffff8123ae06>] ? security_file_permission+0x16/0x20
 [<ffffffff8119a665>] vfs_read+0xb5/0x1a0  [<ffffffff8119b416>] ? fget_light_pos+0x16/0x50  [<ffffffff8119a9b1>] sys_read+0x51/0xb0  [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290  [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b sd 2:0:0:0: [sda] task abort on host 2, ffff880036d7a680 sd 2:0:0:0: [sda] Failed to get completion for aborted cmd ffff880036d7a680 sd 2:0:0:0: [sda] SCSI device reset on scsi2:0


If I just repair systems with that in their runtime history I should be on target for any concerns.


Thanks for the responses...




I've never really had fsck fail to correct errors when run manually. I have had the touch /forcefsck && reboot option decide that a fix was too risky and refuse to do it. The manual run would then fix it. Typically booting single user mode was enough to sort it out. If the problem disk was the root fs, then rescue media was the solution.

We did have an iscsi array reboot which caused the filesystem to go read-only and at the time, we ran fsck -n to check for any errors. We did get a few errors of the type that you'd expect from a filesystem that is mounted, but not any inode or bitmap errors.

We also had a hyper-v vm get in a wedged state because the backup mechanism called the filesystem freeze (fsfreeze) and then the backup software crashed and never unfroze the filesystem. We had to update the backup software and the hyper-v drivers for that.

The only time I couldn't get fsck to behave was when a couple of systems had faulty RAM. In those cases the filesystem corruption was severe and it was easier to replace memory and reimage/restore from backups.

So, I don't think fsck is showing false positives. You should be able to clear the errors with a manual fsck and I would definitely be concerned that a number of systems were showing fs errors.

If you can't get the manual fsck to fix all of the errors, it might be worth opening a support ticket with RedHat.

Hugh





From bsawyers at vt.edu  Fri Dec 22 16:30:52 2017
From: bsawyers at vt.edu (Brandon Sawyers)
Date: Fri, 22 Dec 2017 16:30:52 +0000
Subject: [rhelv6-list] Unsubscribe
In-Reply-To: <DM5PR04MB37712D52400FEF9A4EE5A71A9C020@DM5PR04MB3771.namprd04.prod.outlook.com>
References: <CA+AKB6GPzcV7MNetA3U3aew+KiuF92ebi4aKifHpVXBU3prhbg@mail.gmail.com>
	<CAG2kNCxBQ+Uz+H3o6i7h2f0Z9gKcq86E-UUn1_gP7d_XJq80zg@mail.gmail.com>
	<CA+AKB6FQFt6Lf-GSttzhmiDVTFh7i24bKB+ve-Nk+FGiweaTRQ@mail.gmail.com>
	<DD4C919E4D618E428F06AA127B43857001F6DBEB20@MAILSTORE3.campus.brocku.local>
	<CA+AKB6EcFObk9g0TTt0Aa3hmfdXRNgN-0N5+ZOJ9xPTtkeA+6Q@mail.gmail.com>
	<DM5PR04MB37712D52400FEF9A4EE5A71A9C020@DM5PR04MB3771.namprd04.prod.outlook.com>
Message-ID: <CAFMLoia=kY6Dj_Y6Nw1hjw+SZ9vvdQw1rkB2UCu4Vd-mNcthOg@mail.gmail.com>

On Fri, Dec 22, 2017, 11:29 Brown, Hugh M <hugh-brown at uiowa.edu> wrote:

> Response at bottom
>
> -----Original Message-----
> From: rhelv6-list-bounces at redhat.com [mailto:
> rhelv6-list-bounces at redhat.com] On Behalf Of francis picabia
> Sent: Thursday, December 21, 2017 9:47 AM
> To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list <
> rhelv6-list at redhat.com>
> Subject: Re: [rhelv6-list] fsck -n always showing errors
>
> Thanks for the replies...
>
>
> OK, I was expecting there must be some sort of false positive going on.
>
> For the system I listed here, those are not persistent errors.
>
>
> However there is one which does show the same orphaned inode numbers
>
> on each run, so this is likely real.
>
> # fsck -n /var
> fsck from util-linux-ng 2.17.2
> e2fsck 1.41.12 (17-May-2010)
> Warning!  /dev/sda2 is mounted.
> Warning: skipping journal recovery because doing a read-only filesystem
> check.
> /dev/sda2 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes Deleted inode 1059654 has zero
> dtime.  Fix? no
>
> Inodes that were part of a corrupted orphan linked list found.  Fix? no
>
> Inode 1061014 was part of the orphaned inode list.  IGNORED.
> Inode 1061275 was part of the orphaned inode list.  IGNORED.
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information Block bitmap differences:
> -124293 -130887 -4244999 -4285460 -4979711 -4984408 -4989489 -7052754
> -7052847 -7053693 -7069384 -7069539 -7069657 -7069788 -7074507
> -(7095835--7095839) -7096847 -7097195 -9626336 Fix? no
>
> Free blocks count wrong (6918236, counted=5214069).
> Fix? no
>
> Inode bitmap differences:  -1059654 -1061014 -1061275 Fix? no
>
> Free inodes count wrong (1966010, counted=1878618).
> Fix? no
>
>
> /dev/sda2: ********** WARNING: Filesystem still has errors **********
>
> /dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000
> blocks
>
>
> dmesg shows it had some scsi issues.  I suspect the scsi error
>
> is triggered by operation of VDP backup, which freezes the system
>
> for a second when completing the backup snapshot.
>
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0 sd 2:0:0:0: [sda]
> task abort on host 2, ffff880036e61ac0 sd 2:0:0:0: [sda] task abort on host
> 2, ffff880036e614c0 sd 2:0:0:0: [sda] task abort on host 2,
> ffff880036e61cc0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0 sd 2:0:0:0: [sda]
> task abort on host 2, ffff880036e616c0 sd 2:0:0:0: [sda] task abort on host
> 2, ffff880036e615c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0
> INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> jbd2/sda2-8   D 0000000000000000     0   752      2 0x00000000
>  ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb
>  ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f
>  ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8 Call
> Trace:
>  [<ffffffff813a27eb>] ? scsi_request_fn+0xdb/0x750  [<ffffffff81014b39>] ?
> read_tsc+0x9/0x20  [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100
> [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154b0e3>]
> io_schedule+0x73/0xc0  [<ffffffff811d1440>] sync_buffer+0x40/0x50
> [<ffffffff8154bbcf>] __wait_on_bit+0x5f/0x90  [<ffffffff811d1400>] ?
> sync_buffer+0x0/0x50  [<ffffffff8154bc78>]
> out_of_line_wait_on_bit+0x78/0x90  [<ffffffff810a69b0>] ?
> wake_bit_function+0x0/0x50  [<ffffffff810a67b7>] ? bit_waitqueue+0x17/0xd0
> [<ffffffff811d13f6>] __wait_on_buffer+0x26/0x30  [<ffffffffa0180146>]
> jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2]  [<ffffffff8108fbdb>] ?
> try_to_del_timer_sync+0x7b/0xe0  [<ffffffffa0185a68>] kjournald2+0xb8/0x220
> [jbd2]  [<ffffffff810a6930>] ? autoremove_wake_function+0x0/0x40
> [<ffffffffa01859b0>] ? kjournald2+0x0/0x220 [jbd2]  [<ffffffff810a649e>]
> kthread+0x9e/0xc0  [<ffffffff8100c28a>] child_rip+0xa/0x20
> [<ffffffff810a6400>] ? kthread+0x0/0xc0  [<ffffffff8100c
>  280>] ? child_rip+0x0/0x20
> INFO: task master:1778 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> master        D 0000000000000000     0  1778      1 0x00000080
>  ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460
>  00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001
>  ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8 Call
> Trace:
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154b0e3>]
> io_schedule+0x73/0xc0  [<ffffffff811d1440>] sync_buffer+0x40/0x50
> [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0  [<ffffffff811d1400>] ?
> sync_buffer+0x0/0x50  [<ffffffff8154ba78>]
> out_of_line_wait_on_bit_lock+0x78/0x90
>  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff811d0999>] ?
> __find_get_block+0xa9/0x200  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
> [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
> [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
> [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
> [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
> [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
> [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
> [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
> [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0  [<ffffffff811b7102>]
> file_update_time+0xf2/0x170  [<ffffffff811a4f02>] pipe_write+0x312/0x6b0
> [<ffffffff81199c2a>] do_sync_write+0xfa/0x140  [<ffffffff810a6930>] ?
> autoremove_wake_function+0x0/0x40  [<ffffffff8119f964>] ?
> cp_new_stat+0xe4/0x100  [<ffffffff81014b39>] ? read_tsc+0x9/0x20
> [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100  [<ffffffff8123a
>  e06>] ? security_file_permission+0x16/0x20
>  [<ffffffff81199f28>] vfs_write+0xb8/0x1a0  [<ffffffff8119b416>] ?
> fget_light_pos+0x16/0x50  [<ffffffff8119aa61>] sys_write+0x51/0xb0
> [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
> INFO: task pickup:1236 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> pickup        D 0000000000000001     0  1236   1778 0x00000080
>  ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120
>  ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120
>  ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8 Call
> Trace:
>  [<ffffffff811456e0>] ? __lru_cache_add+0x40/0x90  [<ffffffff811d1400>] ?
> sync_buffer+0x0/0x50  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
> [<ffffffff811d1440>] sync_buffer+0x40/0x50  [<ffffffff8154b99a>]
> __wait_on_bit_lock+0x5a/0xc0  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
> [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
>  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff811d0999>] ?
> __find_get_block+0xa9/0x200  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
> [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
> [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
> [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
> [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
> [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
> [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
> [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
> [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0  [<ffffffff811b7315>]
> touch_atime+0x195/0x1a0  [<ffffffff811a5684>] pipe_read+0x3e4/0x4d0
> [<ffffffff81199d6a>] do_sync_read+0xfa/0x140  [<ffffffff811e2e80>] ?
> ep_send_events_proc+0x0/0x110  [<ffffffff810a6930>] ?
> autoremove_wake_function+0x0/0x40  [<ffffffff8123ae06>] ?
> security_file_permission+0x16/0x20
>  [<ffffffff8119a665>] vfs_read+0xb5/0x1a0  [<ffffffff8119b416>] ?
> fget_light_pos+0x16/0x50  [<ffffffff8119a9b1>] sys_read+0x51/0xb0
> [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b sd 2:0:0:0: [sda] task
> abort on host 2, ffff880036d7a680 sd 2:0:0:0: [sda] Failed to get
> completion for aborted cmd ffff880036d7a680 sd 2:0:0:0: [sda] SCSI device
> reset on scsi2:0
>
>
> If I just repair systems with that in their runtime history I should be on
> target for any concerns.
>
>
> Thanks for the responses...
>
>
>
>
> I've never really had fsck fail to correct errors when run manually. I
> have had the touch /forcefsck && reboot option decide that a fix was too
> risky and refuse to do it. The manual run would then fix it. Typically
> booting single user mode was enough to sort it out. If the problem disk was
> the root fs, then rescue media was the solution.
>
> We did have an iscsi array reboot which caused the filesystem to go
> read-only and at the time, we ran fsck -n to check for any errors. We did
> get a few errors of the type that you'd expect from a filesystem that is
> mounted, but not any inode or bitmap errors.
>
> We also had a hyper-v vm get in a wedged state because the backup
> mechanism called the filesystem freeze (fsfreeze) and then the backup
> software crashed and never unfroze the filesystem. We had to update the
> backup software and the hyper-v drivers for that.
>
> The only time I couldn't get fsck to behave was when a couple of systems
> had faulty RAM. In those cases the filesystem corruption was severe and it
> was easier to replace memory and reimage/restore from backups.
>
> So, I don't think fsck is showing false positives. You should be able to
> clear the errors with a manual fsck and I would definitely be concerned
> that a number of systems were showing fs errors.
>
> If you can't get the manual fsck to fix all of the errors, it might be
> worth opening a support ticket with RedHat.
>
> Hugh
>
>
>
> _______________________________________________
> rhelv6-list mailing list
> rhelv6-list at redhat.com
> https://www.redhat.com/mailman/listinfo/rhelv6-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171222/5d042dce/attachment.htm>

From fpicabia at gmail.com  Fri Dec 22 19:43:14 2017
From: fpicabia at gmail.com (francis picabia)
Date: Fri, 22 Dec 2017 15:43:14 -0400
Subject: [rhelv6-list] fsck -n always showing errors
In-Reply-To: <DM5PR04MB37712D52400FEF9A4EE5A71A9C020@DM5PR04MB3771.namprd04.prod.outlook.com>
References: <CA+AKB6GPzcV7MNetA3U3aew+KiuF92ebi4aKifHpVXBU3prhbg@mail.gmail.com>
	<CAG2kNCxBQ+Uz+H3o6i7h2f0Z9gKcq86E-UUn1_gP7d_XJq80zg@mail.gmail.com>
	<CA+AKB6FQFt6Lf-GSttzhmiDVTFh7i24bKB+ve-Nk+FGiweaTRQ@mail.gmail.com>
	<DD4C919E4D618E428F06AA127B43857001F6DBEB20@MAILSTORE3.campus.brocku.local>
	<CA+AKB6EcFObk9g0TTt0Aa3hmfdXRNgN-0N5+ZOJ9xPTtkeA+6Q@mail.gmail.com>
	<DM5PR04MB37712D52400FEF9A4EE5A71A9C020@DM5PR04MB3771.namprd04.prod.outlook.com>
Message-ID: <CA+AKB6FK+b8WF_ZNPoa50Jmwf4OVpzZ=Xe-Ey6CgF0=hMh_-fg@mail.gmail.com>

On Fri, Dec 22, 2017 at 12:29 PM, Brown, Hugh M <hugh-brown at uiowa.edu>
wrote:

> Response at bottom
>
> -----Original Message-----
> From: rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces@
> redhat.com] On Behalf Of francis picabia
> Sent: Thursday, December 21, 2017 9:47 AM
> To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list <
> rhelv6-list at redhat.com>
> Subject: Re: [rhelv6-list] fsck -n always showing errors
>
> Thanks for the replies...
>
>
> OK, I was expecting there must be some sort of false positive going on.
>
> For the system I listed here, those are not persistent errors.
>
>
> However there is one which does show the same orphaned inode numbers
>
> on each run, so this is likely real.
>
> # fsck -n /var
> fsck from util-linux-ng 2.17.2
> e2fsck 1.41.12 (17-May-2010)
> Warning!  /dev/sda2 is mounted.
> Warning: skipping journal recovery because doing a read-only filesystem
> check.
> /dev/sda2 contains a file system with errors, check forced.
> Pass 1: Checking inodes, blocks, and sizes Deleted inode 1059654 has zero
> dtime.  Fix? no
>
> Inodes that were part of a corrupted orphan linked list found.  Fix? no
>
> Inode 1061014 was part of the orphaned inode list.  IGNORED.
> Inode 1061275 was part of the orphaned inode list.  IGNORED.
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information Block bitmap differences:
> -124293 -130887 -4244999 -4285460 -4979711 -4984408 -4989489 -7052754
> -7052847 -7053693 -7069384 -7069539 -7069657 -7069788 -7074507
> -(7095835--7095839) -7096847 -7097195 -9626336 Fix? no
>
> Free blocks count wrong (6918236, counted=5214069).
> Fix? no
>
> Inode bitmap differences:  -1059654 -1061014 -1061275 Fix? no
>
> Free inodes count wrong (1966010, counted=1878618).
> Fix? no
>
>
> /dev/sda2: ********** WARNING: Filesystem still has errors **********
>
> /dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000
> blocks
>
>
> dmesg shows it had some scsi issues.  I suspect the scsi error
>
> is triggered by operation of VDP backup, which freezes the system
>
> for a second when completing the backup snapshot.
>
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0 sd 2:0:0:0: [sda]
> task abort on host 2, ffff880036e61ac0 sd 2:0:0:0: [sda] task abort on host
> 2, ffff880036e614c0 sd 2:0:0:0: [sda] task abort on host 2,
> ffff880036e61cc0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0
> sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0 sd 2:0:0:0: [sda]
> task abort on host 2, ffff880036e616c0 sd 2:0:0:0: [sda] task abort on host
> 2, ffff880036e615c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0
> INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> jbd2/sda2-8   D 0000000000000000     0   752      2 0x00000000
>  ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb
>  ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f
>  ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8 Call
> Trace:
>  [<ffffffff813a27eb>] ? scsi_request_fn+0xdb/0x750  [<ffffffff81014b39>] ?
> read_tsc+0x9/0x20  [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100
> [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154b0e3>]
> io_schedule+0x73/0xc0  [<ffffffff811d1440>] sync_buffer+0x40/0x50
> [<ffffffff8154bbcf>] __wait_on_bit+0x5f/0x90  [<ffffffff811d1400>] ?
> sync_buffer+0x0/0x50  [<ffffffff8154bc78>] out_of_line_wait_on_bit+0x78/0x90
> [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff810a67b7>] ?
> bit_waitqueue+0x17/0xd0  [<ffffffff811d13f6>] __wait_on_buffer+0x26/0x30
> [<ffffffffa0180146>] jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2]
> [<ffffffff8108fbdb>] ? try_to_del_timer_sync+0x7b/0xe0
> [<ffffffffa0185a68>] kjournald2+0xb8/0x220 [jbd2]  [<ffffffff810a6930>] ?
> autoremove_wake_function+0x0/0x40  [<ffffffffa01859b0>] ?
> kjournald2+0x0/0x220 [jbd2]  [<ffffffff810a649e>] kthread+0x9e/0xc0
> [<ffffffff8100c28a>] child_rip+0xa/0x20  [<ffffffff810a6400>] ?
> kthread+0x0/0xc0  [<ffffffff8100c
>  280>] ? child_rip+0x0/0x20
> INFO: task master:1778 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> master        D 0000000000000000     0  1778      1 0x00000080
>  ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460
>  00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001
>  ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8 Call
> Trace:
>  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50  [<ffffffff8154b0e3>]
> io_schedule+0x73/0xc0  [<ffffffff811d1440>] sync_buffer+0x40/0x50
> [<ffffffff8154b99a>] __wait_on_bit_lock+0x5a/0xc0  [<ffffffff811d1400>] ?
> sync_buffer+0x0/0x50  [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+
> 0x78/0x90
>  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff811d0999>] ?
> __find_get_block+0xa9/0x200  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
> [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
> [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
> [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
> [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
> [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
> [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
> [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
> [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0  [<ffffffff811b7102>]
> file_update_time+0xf2/0x170  [<ffffffff811a4f02>] pipe_write+0x312/0x6b0
> [<ffffffff81199c2a>] do_sync_write+0xfa/0x140  [<ffffffff810a6930>] ?
> autoremove_wake_function+0x0/0x40  [<ffffffff8119f964>] ?
> cp_new_stat+0xe4/0x100  [<ffffffff81014b39>] ? read_tsc+0x9/0x20
> [<ffffffff810b2a4f>] ? ktime_get_ts+0xbf/0x100  [<ffffffff8123a
>  e06>] ? security_file_permission+0x16/0x20
>  [<ffffffff81199f28>] vfs_write+0xb8/0x1a0  [<ffffffff8119b416>] ?
> fget_light_pos+0x16/0x50  [<ffffffff8119aa61>] sys_write+0x51/0xb0
> [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
> INFO: task pickup:1236 blocked for more than 120 seconds.
>       Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> pickup        D 0000000000000001     0  1236   1778 0x00000080
>  ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120
>  ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120
>  ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8 Call
> Trace:
>  [<ffffffff811456e0>] ? __lru_cache_add+0x40/0x90  [<ffffffff811d1400>] ?
> sync_buffer+0x0/0x50  [<ffffffff8154b0e3>] io_schedule+0x73/0xc0
> [<ffffffff811d1440>] sync_buffer+0x40/0x50  [<ffffffff8154b99a>]
> __wait_on_bit_lock+0x5a/0xc0  [<ffffffff811d1400>] ? sync_buffer+0x0/0x50
> [<ffffffff8154ba78>] out_of_line_wait_on_bit_lock+0x78/0x90
>  [<ffffffff810a69b0>] ? wake_bit_function+0x0/0x50  [<ffffffff811d0999>] ?
> __find_get_block+0xa9/0x200  [<ffffffff811d15e6>] __lock_buffer+0x36/0x40
> [<ffffffffa017f2bb>] do_get_write_access+0x48b/0x520 [jbd2]
> [<ffffffffa017f4a1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
> [<ffffffffa01cd4a8>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
> [<ffffffffa01a6d63>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
> [<ffffffffa01a6ddc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
> [<ffffffffa017e3d5>] ? jbd2_journal_start+0xb5/0x100 [jbd2]
> [<ffffffffa01a70d0>] ext4_dirty_inode+0x40/0x60 [ext4]
> [<ffffffff811c69db>] __mark_inode_dirty+0x3b/0x1c0  [<ffffffff811b7315>]
> touch_atime+0x195/0x1a0  [<ffffffff811a5684>] pipe_read+0x3e4/0x4d0
> [<ffffffff81199d6a>] do_sync_read+0xfa/0x140  [<ffffffff811e2e80>] ?
> ep_send_events_proc+0x0/0x110  [<ffffffff810a6930>] ?
> autoremove_wake_function+0x0/0x40  [<ffffffff8123ae06>] ?
> security_file_permission+0x16/0x20
>  [<ffffffff8119a665>] vfs_read+0xb5/0x1a0  [<ffffffff8119b416>] ?
> fget_light_pos+0x16/0x50  [<ffffffff8119a9b1>] sys_read+0x51/0xb0
> [<ffffffff810ee4ce>] ? __audit_syscall_exit+0x25e/0x290
> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b sd 2:0:0:0: [sda] task
> abort on host 2, ffff880036d7a680 sd 2:0:0:0: [sda] Failed to get
> completion for aborted cmd ffff880036d7a680 sd 2:0:0:0: [sda] SCSI device
> reset on scsi2:0
>
>
> If I just repair systems with that in their runtime history I should be on
> target for any concerns.
>
>
> Thanks for the responses...
>
>
>
>
> I've never really had fsck fail to correct errors when run manually. I
> have had the touch /forcefsck && reboot option decide that a fix was too
> risky and refuse to do it. The manual run would then fix it. Typically
> booting single user mode was enough to sort it out. If the problem disk was
> the root fs, then rescue media was the solution.
>
> We did have an iscsi array reboot which caused the filesystem to go
> read-only and at the time, we ran fsck -n to check for any errors. We did
> get a few errors of the type that you'd expect from a filesystem that is
> mounted, but not any inode or bitmap errors.
>
> We also had a hyper-v vm get in a wedged state because the backup
> mechanism called the filesystem freeze (fsfreeze) and then the backup
> software crashed and never unfroze the filesystem. We had to update the
> backup software and the hyper-v drivers for that.
>
> The only time I couldn't get fsck to behave was when a couple of systems
> had faulty RAM. In those cases the filesystem corruption was severe and it
> was easier to replace memory and reimage/restore from backups.
>
> So, I don't think fsck is showing false positives. You should be able to
> clear the errors with a manual fsck and I would definitely be concerned
> that a number of systems were showing fs errors.
>
> If you can't get the manual fsck to fix all of the errors, it might be
> worth opening a support ticket with RedHat.
>
> Hugh
>
>
>
This topic has a degree of "You Mileage May Vary".  Yes, some file system
problems with real physical disk
errors will be difficult or sometimes even impossible to recover from.  It
depends how serious
the flaws are.  If it is only a power loss situation, then the transaction
rollback should do the trick.   If it
is media or a controller or other hardware causing interrupts, it is
anything from 4 to 9 on the
Richter scale - maybe just some loss in the log files, or database files
might be corrupted.

I say fsck is showing false positives because it is doing evaluations while
the file system is changing.  For minor items like block size counts or
missing
mod time, this would be typical of looking at the file system while there
are writes.

If you doubt it, try fsck -n /var on a system having an active website or
database.
See for yourself system what it reports.  I can tell you almost every
system's /var checked
with Redhat was not clean.  I'm talking about production servers and the
like,
not one's relatively quiet desktop system.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/rhelv6-list/attachments/20171222/a43db161/attachment.htm>