From fpicabia at gmail.com Wed Dec 20 20:27:42 2017 From: fpicabia at gmail.com (francis picabia) Date: Wed, 20 Dec 2017 16:27:42 -0400 Subject: [rhelv6-list] fsck -n always showing errors Message-ID: The file systems are typically ext4. Running current patches on Redhat 6.9. This isn't something we routinely look at, but after a couple of VMware systems showing scsi errors, I noticed almost every Redhat 6 system will show some disk errors from something like fsck -n / or same on /var # fsck -n / fsck from util-linux-ng 2.17.2 e2fsck 1.41.12 (17-May-2010) Warning! /dev/sda1 is mounted. Warning: skipping journal recovery because doing a read-only filesystem check. /dev/sda1 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Deleted inode 22413413 has zero dtime. Fix? no Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: -46708735 -(46712832--46712840) +79201706 +79201715 -79201836 -79202321 -(79234801--79234855) +(79234914--79234968) Fix? no Free blocks count wrong (42767651, counted=42765254). Fix? no Inode bitmap differences: -22413413 Fix? no Free inodes count wrong (25523840, counted=25523326). Fix? no /dev/sda1: ********** WARNING: Filesystem still has errors ********** /dev/sda1: 526720/26050560 files (0.2% non-contiguous), 61424349/104192000 blocks The SAN backend for these VMs, nor the host boxes, have no alerts or warnings. With one development box I did touch /forcefsck and rebooted. Retested fsck and still issues. Repeated this cycle 3 times and no improvement. We have seen errors in the file system flagged like this on physical systems and VMs. They are not impacting performance or booting, but given that two showed scsi errors, I would think there is some possibility of data corruption. This is too widespread to be any particular hardware system at fault. I can't tell if there are bugs in ext4 or fsck is giving false positives. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gianluca.cecchi at gmail.com Wed Dec 20 21:57:26 2017 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Wed, 20 Dec 2017 22:57:26 +0100 Subject: [rhelv6-list] fsck -n always showing errors In-Reply-To: References: Message-ID: On Wed, Dec 20, 2017 at 9:27 PM, francis picabia wrote: > > With one development box I did touch /forcefsck and rebooted. > Retested fsck and still issues. Repeated this cycle 3 times > and no improvement. > Hi, not going into the reasons of the problem, but into your "cycle". if I have understood correctly your sentence, you run fsck and use "-n" option that automatically answers "no" to all the questions related to problems and suggestions to fix them. So, as you didn't fix anything, the next run the fsck command exposes the same problems again.... Sometimes I have seen in vSphere environments storage problems causing linux VMs problems and so kernel to automatically put one or more filesystems in read-only mode: typically the filesystems where there were writes in action during the problem occurrence. So in your case it could be something similar with impact to all the VMs insisting on the problematic storage / datastore If you have no monitoring in place, such as Nagios and a monitor like this: https://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_ro_mounts/details you can go ahead also some days before realizing that you had a problem Analyzing /var/log/messages you should see when it happened Take in mind that if the filesystem went in read-only mode due to a SCSI error (action taken by the kernel to prevent further errors and data corruption), you will not be able to remount it read-write, but you have to reboot the server. Just a guess. HIH, Gianluca -------------- next part -------------- An HTML attachment was scrubbed... URL: From solarflow99 at gmail.com Wed Dec 20 22:08:31 2017 From: solarflow99 at gmail.com (solarflow99) Date: Wed, 20 Dec 2017 14:08:31 -0800 Subject: [rhelv6-list] fsck -n always showing errors In-Reply-To: References: Message-ID: thanks for the replies, nice to see some of us are still using the list. I sure wish they just gated email <-> forum instead of eliminating it. On Wed, Dec 20, 2017 at 1:57 PM, Gianluca Cecchi wrote: > > On Wed, Dec 20, 2017 at 9:27 PM, francis picabia > wrote: > >> >> With one development box I did touch /forcefsck and rebooted. >> Retested fsck and still issues. Repeated this cycle 3 times >> and no improvement. >> > > Hi, > not going into the reasons of the problem, but into your "cycle". > if I have understood correctly your sentence, you run fsck and use "-n" > option that automatically answers "no" to all the questions related to > problems and suggestions to fix them. > So, as you didn't fix anything, the next run the fsck command exposes the > same problems again.... > > Sometimes I have seen in vSphere environments storage problems causing > linux VMs problems and so kernel to automatically put one or more > filesystems in read-only mode: typically the filesystems where there were > writes in action during the problem occurrence. > So in your case it could be something similar with impact to all the VMs > insisting on the problematic storage / datastore > If you have no monitoring in place, such as Nagios and a monitor like this: > https://exchange.nagios.org/directory/Plugins/Operating- > Systems/Linux/check_ro_mounts/details > you can go ahead also some days before realizing that you had a problem > Analyzing /var/log/messages you should see when it happened > > Take in mind that if the filesystem went in read-only mode due to a SCSI > error (action taken by the kernel to prevent further errors and data > corruption), you will not be able to remount it read-write, but you have to > reboot the server. > > Just a guess. > HIH, > Gianluca > > > _______________________________________________ > rhelv6-list mailing list > rhelv6-list at redhat.com > https://www.redhat.com/mailman/listinfo/rhelv6-list > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fpicabia at gmail.com Thu Dec 21 12:20:56 2017 From: fpicabia at gmail.com (francis picabia) Date: Thu, 21 Dec 2017 08:20:56 -0400 Subject: [rhelv6-list] fsck -n always showing errors In-Reply-To: References: Message-ID: fsck -n is used to verify only. The touch on /forcefsck will force a regular fsck on unmounted partitions on boot up. So what I've done is: fsck -n touch /forcefsck reboot times three. It should be actually fixing the problems on reboot. I can find there are at least some fsck errors on every Redhat 6 machine, whether virtual or physical. I mean I've tested the fsck -n status on about twelve systems which have some errors. Only 2 showed a history of SCSI errors, both happening to be VMware. Maybe some other people can test this on their Redhat 6 systems and see if fsck -n /var or similar comes back clean. You might be surprised to see the same state I've noticed. There is no issue like read-only file system. Everything is functional. On Wed, Dec 20, 2017 at 5:57 PM, Gianluca Cecchi wrote: > > On Wed, Dec 20, 2017 at 9:27 PM, francis picabia > wrote: > >> >> With one development box I did touch /forcefsck and rebooted. >> Retested fsck and still issues. Repeated this cycle 3 times >> and no improvement. >> > > Hi, > not going into the reasons of the problem, but into your "cycle". > if I have understood correctly your sentence, you run fsck and use "-n" > option that automatically answers "no" to all the questions related to > problems and suggestions to fix them. > So, as you didn't fix anything, the next run the fsck command exposes the > same problems again.... > > Sometimes I have seen in vSphere environments storage problems causing > linux VMs problems and so kernel to automatically put one or more > filesystems in read-only mode: typically the filesystems where there were > writes in action during the problem occurrence. > So in your case it could be something similar with impact to all the VMs > insisting on the problematic storage / datastore > If you have no monitoring in place, such as Nagios and a monitor like this: > https://exchange.nagios.org/directory/Plugins/Operating- > Systems/Linux/check_ro_mounts/details > you can go ahead also some days before realizing that you had a problem > Analyzing /var/log/messages you should see when it happened > > Take in mind that if the filesystem went in read-only mode due to a SCSI > error (action taken by the kernel to prevent further errors and data > corruption), you will not be able to remount it read-write, but you have to > reboot the server. > > Just a guess. > HIH, > Gianluca > > > _______________________________________________ > rhelv6-list mailing list > rhelv6-list at redhat.com > https://www.redhat.com/mailman/listinfo/rhelv6-list > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cfairchild at brocku.ca Thu Dec 21 13:09:03 2017 From: cfairchild at brocku.ca (Cale Fairchild) Date: Thu, 21 Dec 2017 13:09:03 +0000 Subject: [rhelv6-list] fsck -n always showing errors In-Reply-To: References: Message-ID: Have you checked the filesystem from a rescue disk or does the fsck on reboot report that it is fixing errors each time? As far as I understand running `fsck -n /` on the active root filesystem will most always return some errors as the blocks in the filesystem are changing while the fsck is running it?s passes. Thus the warning at the beginning of the process about the filesystem being mounted. Sorry if I am misunderstanding your process, but if you have not tried checking the filesystem after booting into rescue mode that would be a good step. From: rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces at redhat.com] On Behalf Of francis picabia Sent: December 21, 2017 07:21 To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list Subject: Re: [rhelv6-list] fsck -n always showing errors fsck -n is used to verify only. The touch on /forcefsck will force a regular fsck on unmounted partitions on boot up. So what I've done is: fsck -n touch /forcefsck reboot times three. It should be actually fixing the problems on reboot. I can find there are at least some fsck errors on every Redhat 6 machine, whether virtual or physical. I mean I've tested the fsck -n status on about twelve systems which have some errors. Only 2 showed a history of SCSI errors, both happening to be VMware. Maybe some other people can test this on their Redhat 6 systems and see if fsck -n /var or similar comes back clean. You might be surprised to see the same state I've noticed. There is no issue like read-only file system. Everything is functional. On Wed, Dec 20, 2017 at 5:57 PM, Gianluca Cecchi > wrote: On Wed, Dec 20, 2017 at 9:27 PM, francis picabia > wrote: With one development box I did touch /forcefsck and rebooted. Retested fsck and still issues. Repeated this cycle 3 times and no improvement. Hi, not going into the reasons of the problem, but into your "cycle". if I have understood correctly your sentence, you run fsck and use "-n" option that automatically answers "no" to all the questions related to problems and suggestions to fix them. So, as you didn't fix anything, the next run the fsck command exposes the same problems again.... Sometimes I have seen in vSphere environments storage problems causing linux VMs problems and so kernel to automatically put one or more filesystems in read-only mode: typically the filesystems where there were writes in action during the problem occurrence. So in your case it could be something similar with impact to all the VMs insisting on the problematic storage / datastore If you have no monitoring in place, such as Nagios and a monitor like this: https://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_ro_mounts/details you can go ahead also some days before realizing that you had a problem Analyzing /var/log/messages you should see when it happened Take in mind that if the filesystem went in read-only mode due to a SCSI error (action taken by the kernel to prevent further errors and data corruption), you will not be able to remount it read-write, but you have to reboot the server. Just a guess. HIH, Gianluca _______________________________________________ rhelv6-list mailing list rhelv6-list at redhat.com https://www.redhat.com/mailman/listinfo/rhelv6-list -------------- next part -------------- An HTML attachment was scrubbed... URL: From fpicabia at gmail.com Thu Dec 21 15:46:59 2017 From: fpicabia at gmail.com (francis picabia) Date: Thu, 21 Dec 2017 11:46:59 -0400 Subject: [rhelv6-list] fsck -n always showing errors In-Reply-To: References: Message-ID: Thanks for the replies... OK, I was expecting there must be some sort of false positive going on. For the system I listed here, those are not persistent errors. However there is one which does show the same orphaned inode numbers on each run, so this is likely real. # fsck -n /var fsck from util-linux-ng 2.17.2 e2fsck 1.41.12 (17-May-2010) Warning! /dev/sda2 is mounted. Warning: skipping journal recovery because doing a read-only filesystem check. /dev/sda2 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Deleted inode 1059654 has zero dtime. Fix? no Inodes that were part of a corrupted orphan linked list found. Fix? no Inode 1061014 was part of the orphaned inode list. IGNORED. Inode 1061275 was part of the orphaned inode list. IGNORED. Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: -124293 -130887 -4244999 -4285460 -4979711 -4984408 -4989489 -7052754 -7052847 -7053693 -7069384 -7069539 -7069657 -7069788 -7074507 -(7095835--7095839) -7096847 -7097195 -9626336 Fix? no Free blocks count wrong (6918236, counted=5214069). Fix? no Inode bitmap differences: -1059654 -1061014 -1061275 Fix? no Free inodes count wrong (1966010, counted=1878618). Fix? no /dev/sda2: ********** WARNING: Filesystem still has errors ********** /dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000 blocks dmesg shows it had some scsi issues. I suspect the scsi error is triggered by operation of VDP backup, which freezes the system for a second when completing the backup snapshot. sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61ac0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e614c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61cc0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e616c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e615c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0 INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds. Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. jbd2/sda2-8 D 0000000000000000 0 752 2 0x00000000 ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8 Call Trace: [] ? scsi_request_fn+0xdb/0x750 [] ? read_tsc+0x9/0x20 [] ? ktime_get_ts+0xbf/0x100 [] ? sync_buffer+0x0/0x50 [] io_schedule+0x73/0xc0 [] sync_buffer+0x40/0x50 [] __wait_on_bit+0x5f/0x90 [] ? sync_buffer+0x0/0x50 [] out_of_line_wait_on_bit+0x78/0x90 [] ? wake_bit_function+0x0/0x50 [] ? bit_waitqueue+0x17/0xd0 [] __wait_on_buffer+0x26/0x30 [] jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2] [] ? try_to_del_timer_sync+0x7b/0xe0 [] kjournald2+0xb8/0x220 [jbd2] [] ? autoremove_wake_function+0x0/0x40 [] ? kjournald2+0x0/0x220 [jbd2] [] kthread+0x9e/0xc0 [] child_rip+0xa/0x20 [] ? kthread+0x0/0xc0 [] ? child_rip+0x0/0x20 INFO: task master:1778 blocked for more than 120 seconds. Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. master D 0000000000000000 0 1778 1 0x00000080 ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460 00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001 ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8 Call Trace: [] ? sync_buffer+0x0/0x50 [] io_schedule+0x73/0xc0 [] sync_buffer+0x40/0x50 [] __wait_on_bit_lock+0x5a/0xc0 [] ? sync_buffer+0x0/0x50 [] out_of_line_wait_on_bit_lock+0x78/0x90 [] ? wake_bit_function+0x0/0x50 [] ? __find_get_block+0xa9/0x200 [] __lock_buffer+0x36/0x40 [] do_get_write_access+0x48b/0x520 [jbd2] [] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [] __ext4_journal_get_write_access+0x38/0x80 [ext4] [] ext4_reserve_inode_write+0x73/0xa0 [ext4] [] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] [] ? jbd2_journal_start+0xb5/0x100 [jbd2] [] ext4_dirty_inode+0x40/0x60 [ext4] [] __mark_inode_dirty+0x3b/0x1c0 [] file_update_time+0xf2/0x170 [] pipe_write+0x312/0x6b0 [] do_sync_write+0xfa/0x140 [] ? autoremove_wake_function+0x0/0x40 [] ? cp_new_stat+0xe4/0x100 [] ? read_tsc+0x9/0x20 [] ? ktime_get_ts+0xbf/0x100 [] ? security_file_permission+0x16/0x20 [] vfs_write+0xb8/0x1a0 [] ? fget_light_pos+0x16/0x50 [] sys_write+0x51/0xb0 [] ? __audit_syscall_exit+0x25e/0x290 [] system_call_fastpath+0x16/0x1b INFO: task pickup:1236 blocked for more than 120 seconds. Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. pickup D 0000000000000001 0 1236 1778 0x00000080 ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120 ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120 ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8 Call Trace: [] ? __lru_cache_add+0x40/0x90 [] ? sync_buffer+0x0/0x50 [] io_schedule+0x73/0xc0 [] sync_buffer+0x40/0x50 [] __wait_on_bit_lock+0x5a/0xc0 [] ? sync_buffer+0x0/0x50 [] out_of_line_wait_on_bit_lock+0x78/0x90 [] ? wake_bit_function+0x0/0x50 [] ? __find_get_block+0xa9/0x200 [] __lock_buffer+0x36/0x40 [] do_get_write_access+0x48b/0x520 [jbd2] [] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [] __ext4_journal_get_write_access+0x38/0x80 [ext4] [] ext4_reserve_inode_write+0x73/0xa0 [ext4] [] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] [] ? jbd2_journal_start+0xb5/0x100 [jbd2] [] ext4_dirty_inode+0x40/0x60 [ext4] [] __mark_inode_dirty+0x3b/0x1c0 [] touch_atime+0x195/0x1a0 [] pipe_read+0x3e4/0x4d0 [] do_sync_read+0xfa/0x140 [] ? ep_send_events_proc+0x0/0x110 [] ? autoremove_wake_function+0x0/0x40 [] ? security_file_permission+0x16/0x20 [] vfs_read+0xb5/0x1a0 [] ? fget_light_pos+0x16/0x50 [] sys_read+0x51/0xb0 [] ? __audit_syscall_exit+0x25e/0x290 [] system_call_fastpath+0x16/0x1b sd 2:0:0:0: [sda] task abort on host 2, ffff880036d7a680 sd 2:0:0:0: [sda] Failed to get completion for aborted cmd ffff880036d7a680 sd 2:0:0:0: [sda] SCSI device reset on scsi2:0 If I just repair systems with that in their runtime history I should be on target for any concerns. Thanks for the responses... On Thu, Dec 21, 2017 at 9:09 AM, Cale Fairchild wrote: > Have you checked the filesystem from a rescue disk or does the fsck on > reboot report that it is fixing errors each time? As far as I understand > running `fsck -n /` on the active root filesystem will most always return > some errors as the blocks in the filesystem are changing while the fsck is > running it?s passes. Thus the warning at the beginning of the process about > the filesystem being mounted. Sorry if I am misunderstanding your process, > but if you have not tried checking the filesystem after booting into rescue > mode that would be a good step. > > > > *From:* rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces@ > redhat.com] *On Behalf Of *francis picabia > *Sent:* December 21, 2017 07:21 > *To:* Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list < > rhelv6-list at redhat.com> > *Subject:* Re: [rhelv6-list] fsck -n always showing errors > > > > fsck -n is used to verify only. > > The touch on /forcefsck will force a regular fsck on unmounted > > partitions on boot up. > > So what I've done is: > > fsck -n > > touch /forcefsck > > reboot > > times three. > > It should be actually fixing the problems on reboot. > > I can find there are at least some fsck errors on every Redhat 6 machine, > > whether virtual or physical. I mean I've tested the fsck -n status on > about > > twelve systems which have some errors. Only 2 showed a history > > of SCSI errors, both happening to be VMware. > > Maybe some other people can test this on their Redhat 6 systems > > and see if fsck -n /var or similar comes back clean. You might > > be surprised to see the same state I've noticed. There is > > no issue like read-only file system. Everything is functional. > > > > > > On Wed, Dec 20, 2017 at 5:57 PM, Gianluca Cecchi < > gianluca.cecchi at gmail.com> wrote: > > > > On Wed, Dec 20, 2017 at 9:27 PM, francis picabia > wrote: > > > > With one development box I did touch /forcefsck and rebooted. > > Retested fsck and still issues. Repeated this cycle 3 times > > and no improvement. > > > > Hi, > > not going into the reasons of the problem, but into your "cycle". > > if I have understood correctly your sentence, you run fsck and use "-n" > option that automatically answers "no" to all the questions related to > problems and suggestions to fix them. > > So, as you didn't fix anything, the next run the fsck command exposes the > same problems again.... > > > > Sometimes I have seen in vSphere environments storage problems causing > linux VMs problems and so kernel to automatically put one or more > filesystems in read-only mode: typically the filesystems where there were > writes in action during the problem occurrence. > > So in your case it could be something similar with impact to all the VMs > insisting on the problematic storage / datastore > > If you have no monitoring in place, such as Nagios and a monitor like this: > > https://exchange.nagios.org/directory/Plugins/Operating- > Systems/Linux/check_ro_mounts/details > > you can go ahead also some days before realizing that you had a problem > > Analyzing /var/log/messages you should see when it happened > > > > Take in mind that if the filesystem went in read-only mode due to a SCSI > error (action taken by the kernel to prevent further errors and data > corruption), you will not be able to remount it read-write, but you have to > reboot the server. > > > > Just a guess. > > HIH, > > Gianluca > > > > > _______________________________________________ > rhelv6-list mailing list > rhelv6-list at redhat.com > https://www.redhat.com/mailman/listinfo/rhelv6-list > > > > _______________________________________________ > rhelv6-list mailing list > rhelv6-list at redhat.com > https://www.redhat.com/mailman/listinfo/rhelv6-list > -------------- next part -------------- An HTML attachment was scrubbed... URL: From solarflow99 at gmail.com Thu Dec 21 17:06:56 2017 From: solarflow99 at gmail.com (solarflow99) Date: Thu, 21 Dec 2017 09:06:56 -0800 Subject: [rhelv6-list] fsck -n always showing errors In-Reply-To: References: Message-ID: I'd just do a rescue, doesn't even need to be EL-6, and do the fsck in rw mode On Dec 21, 2017 7:47 AM, "francis picabia" wrote: > Thanks for the replies... > > OK, I was expecting there must be some sort of false positive going on. > For the system I listed here, those are not persistent errors. > > However there is one which does show the same orphaned inode numbers > on each run, so this is likely real. > > # fsck -n /var > fsck from util-linux-ng 2.17.2 > e2fsck 1.41.12 (17-May-2010) > Warning! /dev/sda2 is mounted. > Warning: skipping journal recovery because doing a read-only filesystem > check. > /dev/sda2 contains a file system with errors, check forced. > Pass 1: Checking inodes, blocks, and sizes > Deleted inode 1059654 has zero dtime. Fix? no > > Inodes that were part of a corrupted orphan linked list found. Fix? no > > Inode 1061014 was part of the orphaned inode list. IGNORED. > Inode 1061275 was part of the orphaned inode list. IGNORED. > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > Block bitmap differences: -124293 -130887 -4244999 -4285460 -4979711 > -4984408 -4989489 -7052754 -7052847 -7053693 -7069384 -7069539 -7069657 > -7069788 -7074507 -(7095835--7095839) -7096847 -7097195 -9626336 > Fix? no > > Free blocks count wrong (6918236, counted=5214069). > Fix? no > > Inode bitmap differences: -1059654 -1061014 -1061275 > Fix? no > > Free inodes count wrong (1966010, counted=1878618). > Fix? no > > > /dev/sda2: ********** WARNING: Filesystem still has errors ********** > > /dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000 > blocks > > dmesg shows it had some scsi issues. I suspect the scsi error > is triggered by operation of VDP backup, which freezes the system > for a second when completing the backup snapshot. > > sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0 > sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61ac0 > sd 2:0:0:0: [sda] task abort on host 2, ffff880036e614c0 > sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61cc0 > sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0 > sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0 > sd 2:0:0:0: [sda] task abort on host 2, ffff880036e616c0 > sd 2:0:0:0: [sda] task abort on host 2, ffff880036e615c0 > sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0 > INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds. > Not tainted 2.6.32-696.3.2.el6.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > jbd2/sda2-8 D 0000000000000000 0 752 2 0x00000000 > ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb > ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f > ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8 > Call Trace: > [] ? scsi_request_fn+0xdb/0x750 > [] ? read_tsc+0x9/0x20 > [] ? ktime_get_ts+0xbf/0x100 > [] ? sync_buffer+0x0/0x50 > [] io_schedule+0x73/0xc0 > [] sync_buffer+0x40/0x50 > [] __wait_on_bit+0x5f/0x90 > [] ? sync_buffer+0x0/0x50 > [] out_of_line_wait_on_bit+0x78/0x90 > [] ? wake_bit_function+0x0/0x50 > [] ? bit_waitqueue+0x17/0xd0 > [] __wait_on_buffer+0x26/0x30 > [] jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2] > [] ? try_to_del_timer_sync+0x7b/0xe0 > [] kjournald2+0xb8/0x220 [jbd2] > [] ? autoremove_wake_function+0x0/0x40 > [] ? kjournald2+0x0/0x220 [jbd2] > [] kthread+0x9e/0xc0 > [] child_rip+0xa/0x20 > [] ? kthread+0x0/0xc0 > [] ? child_rip+0x0/0x20 > INFO: task master:1778 blocked for more than 120 seconds. > Not tainted 2.6.32-696.3.2.el6.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > master D 0000000000000000 0 1778 1 0x00000080 > ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460 > 00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001 > ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8 > Call Trace: > [] ? sync_buffer+0x0/0x50 > [] io_schedule+0x73/0xc0 > [] sync_buffer+0x40/0x50 > [] __wait_on_bit_lock+0x5a/0xc0 > [] ? sync_buffer+0x0/0x50 > [] out_of_line_wait_on_bit_lock+0x78/0x90 > [] ? wake_bit_function+0x0/0x50 > [] ? __find_get_block+0xa9/0x200 > [] __lock_buffer+0x36/0x40 > [] do_get_write_access+0x48b/0x520 [jbd2] > [] jbd2_journal_get_write_access+0x31/0x50 [jbd2] > [] __ext4_journal_get_write_access+0x38/0x80 [ext4] > [] ext4_reserve_inode_write+0x73/0xa0 [ext4] > [] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] > [] ? jbd2_journal_start+0xb5/0x100 [jbd2] > [] ext4_dirty_inode+0x40/0x60 [ext4] > [] __mark_inode_dirty+0x3b/0x1c0 > [] file_update_time+0xf2/0x170 > [] pipe_write+0x312/0x6b0 > [] do_sync_write+0xfa/0x140 > [] ? autoremove_wake_function+0x0/0x40 > [] ? cp_new_stat+0xe4/0x100 > [] ? read_tsc+0x9/0x20 > [] ? ktime_get_ts+0xbf/0x100 > [] ? security_file_permission+0x16/0x20 > [] vfs_write+0xb8/0x1a0 > [] ? fget_light_pos+0x16/0x50 > [] sys_write+0x51/0xb0 > [] ? __audit_syscall_exit+0x25e/0x290 > [] system_call_fastpath+0x16/0x1b > INFO: task pickup:1236 blocked for more than 120 seconds. > Not tainted 2.6.32-696.3.2.el6.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > pickup D 0000000000000001 0 1236 1778 0x00000080 > ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120 > ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120 > ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8 > Call Trace: > [] ? __lru_cache_add+0x40/0x90 > [] ? sync_buffer+0x0/0x50 > [] io_schedule+0x73/0xc0 > [] sync_buffer+0x40/0x50 > [] __wait_on_bit_lock+0x5a/0xc0 > [] ? sync_buffer+0x0/0x50 > [] out_of_line_wait_on_bit_lock+0x78/0x90 > [] ? wake_bit_function+0x0/0x50 > [] ? __find_get_block+0xa9/0x200 > [] __lock_buffer+0x36/0x40 > [] do_get_write_access+0x48b/0x520 [jbd2] > [] jbd2_journal_get_write_access+0x31/0x50 [jbd2] > [] __ext4_journal_get_write_access+0x38/0x80 [ext4] > [] ext4_reserve_inode_write+0x73/0xa0 [ext4] > [] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] > [] ? jbd2_journal_start+0xb5/0x100 [jbd2] > [] ext4_dirty_inode+0x40/0x60 [ext4] > [] __mark_inode_dirty+0x3b/0x1c0 > [] touch_atime+0x195/0x1a0 > [] pipe_read+0x3e4/0x4d0 > [] do_sync_read+0xfa/0x140 > [] ? ep_send_events_proc+0x0/0x110 > [] ? autoremove_wake_function+0x0/0x40 > [] ? security_file_permission+0x16/0x20 > [] vfs_read+0xb5/0x1a0 > [] ? fget_light_pos+0x16/0x50 > [] sys_read+0x51/0xb0 > [] ? __audit_syscall_exit+0x25e/0x290 > [] system_call_fastpath+0x16/0x1b > sd 2:0:0:0: [sda] task abort on host 2, ffff880036d7a680 > sd 2:0:0:0: [sda] Failed to get completion for aborted cmd ffff880036d7a680 > sd 2:0:0:0: [sda] SCSI device reset on scsi2:0 > > If I just repair systems with that in their runtime history I should be on > target > for any concerns. > > Thanks for the responses... > > > On Thu, Dec 21, 2017 at 9:09 AM, Cale Fairchild > wrote: > >> Have you checked the filesystem from a rescue disk or does the fsck on >> reboot report that it is fixing errors each time? As far as I understand >> running `fsck -n /` on the active root filesystem will most always return >> some errors as the blocks in the filesystem are changing while the fsck is >> running it?s passes. Thus the warning at the beginning of the process about >> the filesystem being mounted. Sorry if I am misunderstanding your process, >> but if you have not tried checking the filesystem after booting into rescue >> mode that would be a good step. >> >> >> >> *From:* rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces at re >> dhat.com] *On Behalf Of *francis picabia >> *Sent:* December 21, 2017 07:21 >> *To:* Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list < >> rhelv6-list at redhat.com> >> *Subject:* Re: [rhelv6-list] fsck -n always showing errors >> >> >> >> fsck -n is used to verify only. >> >> The touch on /forcefsck will force a regular fsck on unmounted >> >> partitions on boot up. >> >> So what I've done is: >> >> fsck -n >> >> touch /forcefsck >> >> reboot >> >> times three. >> >> It should be actually fixing the problems on reboot. >> >> I can find there are at least some fsck errors on every Redhat 6 machine, >> >> whether virtual or physical. I mean I've tested the fsck -n status on >> about >> >> twelve systems which have some errors. Only 2 showed a history >> >> of SCSI errors, both happening to be VMware. >> >> Maybe some other people can test this on their Redhat 6 systems >> >> and see if fsck -n /var or similar comes back clean. You might >> >> be surprised to see the same state I've noticed. There is >> >> no issue like read-only file system. Everything is functional. >> >> >> >> >> >> On Wed, Dec 20, 2017 at 5:57 PM, Gianluca Cecchi < >> gianluca.cecchi at gmail.com> wrote: >> >> >> >> On Wed, Dec 20, 2017 at 9:27 PM, francis picabia >> wrote: >> >> >> >> With one development box I did touch /forcefsck and rebooted. >> >> Retested fsck and still issues. Repeated this cycle 3 times >> >> and no improvement. >> >> >> >> Hi, >> >> not going into the reasons of the problem, but into your "cycle". >> >> if I have understood correctly your sentence, you run fsck and use "-n" >> option that automatically answers "no" to all the questions related to >> problems and suggestions to fix them. >> >> So, as you didn't fix anything, the next run the fsck command exposes the >> same problems again.... >> >> >> >> Sometimes I have seen in vSphere environments storage problems causing >> linux VMs problems and so kernel to automatically put one or more >> filesystems in read-only mode: typically the filesystems where there were >> writes in action during the problem occurrence. >> >> So in your case it could be something similar with impact to all the VMs >> insisting on the problematic storage / datastore >> >> If you have no monitoring in place, such as Nagios and a monitor like >> this: >> >> https://exchange.nagios.org/directory/Plugins/Operating-Syst >> ems/Linux/check_ro_mounts/details >> >> you can go ahead also some days before realizing that you had a problem >> >> Analyzing /var/log/messages you should see when it happened >> >> >> >> Take in mind that if the filesystem went in read-only mode due to a SCSI >> error (action taken by the kernel to prevent further errors and data >> corruption), you will not be able to remount it read-write, but you have to >> reboot the server. >> >> >> >> Just a guess. >> >> HIH, >> >> Gianluca >> >> >> >> >> _______________________________________________ >> rhelv6-list mailing list >> rhelv6-list at redhat.com >> https://www.redhat.com/mailman/listinfo/rhelv6-list >> >> >> >> _______________________________________________ >> rhelv6-list mailing list >> rhelv6-list at redhat.com >> https://www.redhat.com/mailman/listinfo/rhelv6-list >> > > > _______________________________________________ > rhelv6-list mailing list > rhelv6-list at redhat.com > https://www.redhat.com/mailman/listinfo/rhelv6-list > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tim.Mooney at ndsu.edu Thu Dec 21 18:48:23 2017 From: Tim.Mooney at ndsu.edu (Tim Mooney) Date: Thu, 21 Dec 2017 12:48:23 -0600 (CST) Subject: [rhelv6-list] fsck -n always showing errors In-Reply-To: References: Message-ID: In regard to: rhelv6-list Digest, Vol 76, Issue 1, rhelv6-list-request at redh...: > This isn't something we routinely look at, but after > a couple of VMware systems showing scsi errors, I noticed almost > every Redhat 6 system will show some disk errors from > something like fsck -n / or same on /var > > # fsck -n / > fsck from util-linux-ng 2.17.2 > e2fsck 1.41.12 (17-May-2010) > Warning! /dev/sda1 is mounted. There's your problem. Don't run fsck on a mounted filesystem. Even with -n, it just shows you false positives. Do some web searching for fsck on a mounted filesystem to understand why. Tim -- Tim Mooney Tim.Mooney at ndsu.edu Enterprise Computing & Infrastructure 701-231-1076 (Voice) Room 242-J6, Quentin Burdick Building 701-231-8541 (Fax) North Dakota State University, Fargo, ND 58105-5164 From fpicabia at gmail.com Fri Dec 22 15:05:04 2017 From: fpicabia at gmail.com (francis picabia) Date: Fri, 22 Dec 2017 11:05:04 -0400 Subject: [rhelv6-list] fsck -n always showing errors In-Reply-To: References: Message-ID: On Thu, Dec 21, 2017 at 2:48 PM, Tim Mooney wrote: > In regard to: rhelv6-list Digest, Vol 76, Issue 1, > rhelv6-list-request at redh...: > > This isn't something we routinely look at, but after >> a couple of VMware systems showing scsi errors, I noticed almost >> every Redhat 6 system will show some disk errors from >> something like fsck -n / or same on /var >> >> # fsck -n / >> fsck from util-linux-ng 2.17.2 >> e2fsck 1.41.12 (17-May-2010) >> Warning! /dev/sda1 is mounted. >> > > There's your problem. Don't run fsck on a mounted filesystem. Even > with -n, it just shows you false positives. > > Do some web searching for > > fsck on a mounted filesystem > > to understand why. > > Well, I think they make the -n/-N flag in fsck for some purpose other than don't do it. It is designed to be run on a system to check it without modifying. My conclusion is it is only useful for seeing an error such as orphaned inodes which are persistent across multiple runs of fsck -n If there are other checksums and such that don't seem correct, that would be expected on a live filesystem. -------------- next part -------------- An HTML attachment was scrubbed... URL: From KCollins at chevron.com Fri Dec 22 15:42:26 2017 From: KCollins at chevron.com (Collins, Kevin) Date: Fri, 22 Dec 2017 15:42:26 +0000 Subject: [rhelv6-list] fsck -n always showing errors In-Reply-To: References: Message-ID: <6F56410FBED1FC41BCA804E16F594B0B78C94B76@san520w8xmbx05.gdc0.chevron.net> From ?man fsck?: -N Don't execute, just show what would be done. and: Options to different filesystem-specific fsck's are not standardized. If in doubt, please consult the man pages of the filesystem-specific checker. Although not guaranteed, the following options are supported by most file system checkers: ? ? -n For some filesystem-specific checkers, the -n option will cause the fs-specific fsck to avoid attempting to repair any problems, but simply report such problems to stdout. This is however not true for all filesystem-specific checkers. In particular, fsck.reiserfs(8) will not report any corruption if given this option. fsck.minix(8) does not support the -n option at all. From ?man e2fsck?: -n Open the filesystem read-only, and assume an answer of 'no' to all questions. Allows e2fsck to be used non-interactively. This option may not be specified at the same time as the -p or -y options. Kevin From: rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces at redhat.com] On Behalf Of francis picabia Sent: Friday, December 22, 2017 7:05 AM To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list Subject: [**EXTERNAL**] Re: [rhelv6-list] fsck -n always showing errors On Thu, Dec 21, 2017 at 2:48 PM, Tim Mooney > wrote: In regard to: rhelv6-list Digest, Vol 76, Issue 1, rhelv6-list-request at redh...: This isn't something we routinely look at, but after a couple of VMware systems showing scsi errors, I noticed almost every Redhat 6 system will show some disk errors from something like fsck -n / or same on /var # fsck -n / fsck from util-linux-ng 2.17.2 e2fsck 1.41.12 (17-May-2010) Warning! /dev/sda1 is mounted. There's your problem. Don't run fsck on a mounted filesystem. Even with -n, it just shows you false positives. Do some web searching for fsck on a mounted filesystem to understand why. Well, I think they make the -n/-N flag in fsck for some purpose other than don't do it. It is designed to be run on a system to check it without modifying. My conclusion is it is only useful for seeing an error such as orphaned inodes which are persistent across multiple runs of fsck -n If there are other checksums and such that don't seem correct, that would be expected on a live filesystem. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hugh-brown at uiowa.edu Fri Dec 22 16:29:27 2017 From: hugh-brown at uiowa.edu (Brown, Hugh M) Date: Fri, 22 Dec 2017 16:29:27 +0000 Subject: [rhelv6-list] fsck -n always showing errors In-Reply-To: References: Message-ID: Response at bottom -----Original Message----- From: rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces at redhat.com] On Behalf Of francis picabia Sent: Thursday, December 21, 2017 9:47 AM To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list Subject: Re: [rhelv6-list] fsck -n always showing errors Thanks for the replies... OK, I was expecting there must be some sort of false positive going on. For the system I listed here, those are not persistent errors. However there is one which does show the same orphaned inode numbers on each run, so this is likely real. # fsck -n /var fsck from util-linux-ng 2.17.2 e2fsck 1.41.12 (17-May-2010) Warning! /dev/sda2 is mounted. Warning: skipping journal recovery because doing a read-only filesystem check. /dev/sda2 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Deleted inode 1059654 has zero dtime. Fix? no Inodes that were part of a corrupted orphan linked list found. Fix? no Inode 1061014 was part of the orphaned inode list. IGNORED. Inode 1061275 was part of the orphaned inode list. IGNORED. Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: -124293 -130887 -4244999 -4285460 -4979711 -4984408 -4989489 -7052754 -7052847 -7053693 -7069384 -7069539 -7069657 -7069788 -7074507 -(7095835--7095839) -7096847 -7097195 -9626336 Fix? no Free blocks count wrong (6918236, counted=5214069). Fix? no Inode bitmap differences: -1059654 -1061014 -1061275 Fix? no Free inodes count wrong (1966010, counted=1878618). Fix? no /dev/sda2: ********** WARNING: Filesystem still has errors ********** /dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000 blocks dmesg shows it had some scsi issues. I suspect the scsi error is triggered by operation of VDP backup, which freezes the system for a second when completing the backup snapshot. sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61ac0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e614c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61cc0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e616c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e615c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0 INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds. Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. jbd2/sda2-8 D 0000000000000000 0 752 2 0x00000000 ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8 Call Trace: [] ? scsi_request_fn+0xdb/0x750 [] ? read_tsc+0x9/0x20 [] ? ktime_get_ts+0xbf/0x100 [] ? sync_buffer+0x0/0x50 [] io_schedule+0x73/0xc0 [] sync_buffer+0x40/0x50 [] __wait_on_bit+0x5f/0x90 [] ? sync_buffer+0x0/0x50 [] out_of_line_wait_on_bit+0x78/0x90 [] ? wake_bit_function+0x0/0x50 [] ? bit_waitqueue+0x17/0xd0 [] __wait_on_buffer+0x26/0x30 [] jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2] [] ? try_to_del_timer_sync+0x7b/0xe0 [] kjournald2+0xb8/0x220 [jbd2] [] ? autoremove_wake_function+0x0/0x40 [] ? kjournald2+0x0/0x220 [jbd2] [] kthread+0x9e/0xc0 [] child_rip+0xa/0x20 [] ? kthread+0x0/0xc0 [] ? child_rip+0x0/0x20 INFO: task master:1778 blocked for more than 120 seconds. Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. master D 0000000000000000 0 1778 1 0x00000080 ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460 00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001 ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8 Call Trace: [] ? sync_buffer+0x0/0x50 [] io_schedule+0x73/0xc0 [] sync_buffer+0x40/0x50 [] __wait_on_bit_lock+0x5a/0xc0 [] ? sync_buffer+0x0/0x50 [] out_of_line_wait_on_bit_lock+0x78/0x90 [] ? wake_bit_function+0x0/0x50 [] ? __find_get_block+0xa9/0x200 [] __lock_buffer+0x36/0x40 [] do_get_write_access+0x48b/0x520 [jbd2] [] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [] __ext4_journal_get_write_access+0x38/0x80 [ext4] [] ext4_reserve_inode_write+0x73/0xa0 [ext4] [] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] [] ? jbd2_journal_start+0xb5/0x100 [jbd2] [] ext4_dirty_inode+0x40/0x60 [ext4] [] __mark_inode_dirty+0x3b/0x1c0 [] file_update_time+0xf2/0x170 [] pipe_write+0x312/0x6b0 [] do_sync_write+0xfa/0x140 [] ? autoremove_wake_function+0x0/0x40 [] ? cp_new_stat+0xe4/0x100 [] ? read_tsc+0x9/0x20 [] ? ktime_get_ts+0xbf/0x100 [] ? security_file_permission+0x16/0x20 [] vfs_write+0xb8/0x1a0 [] ? fget_light_pos+0x16/0x50 [] sys_write+0x51/0xb0 [] ? __audit_syscall_exit+0x25e/0x290 [] system_call_fastpath+0x16/0x1b INFO: task pickup:1236 blocked for more than 120 seconds. Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. pickup D 0000000000000001 0 1236 1778 0x00000080 ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120 ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120 ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8 Call Trace: [] ? __lru_cache_add+0x40/0x90 [] ? sync_buffer+0x0/0x50 [] io_schedule+0x73/0xc0 [] sync_buffer+0x40/0x50 [] __wait_on_bit_lock+0x5a/0xc0 [] ? sync_buffer+0x0/0x50 [] out_of_line_wait_on_bit_lock+0x78/0x90 [] ? wake_bit_function+0x0/0x50 [] ? __find_get_block+0xa9/0x200 [] __lock_buffer+0x36/0x40 [] do_get_write_access+0x48b/0x520 [jbd2] [] jbd2_journal_get_write_access+0x31/0x50 [jbd2] [] __ext4_journal_get_write_access+0x38/0x80 [ext4] [] ext4_reserve_inode_write+0x73/0xa0 [ext4] [] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] [] ? jbd2_journal_start+0xb5/0x100 [jbd2] [] ext4_dirty_inode+0x40/0x60 [ext4] [] __mark_inode_dirty+0x3b/0x1c0 [] touch_atime+0x195/0x1a0 [] pipe_read+0x3e4/0x4d0 [] do_sync_read+0xfa/0x140 [] ? ep_send_events_proc+0x0/0x110 [] ? autoremove_wake_function+0x0/0x40 [] ? security_file_permission+0x16/0x20 [] vfs_read+0xb5/0x1a0 [] ? fget_light_pos+0x16/0x50 [] sys_read+0x51/0xb0 [] ? __audit_syscall_exit+0x25e/0x290 [] system_call_fastpath+0x16/0x1b sd 2:0:0:0: [sda] task abort on host 2, ffff880036d7a680 sd 2:0:0:0: [sda] Failed to get completion for aborted cmd ffff880036d7a680 sd 2:0:0:0: [sda] SCSI device reset on scsi2:0 If I just repair systems with that in their runtime history I should be on target for any concerns. Thanks for the responses... I've never really had fsck fail to correct errors when run manually. I have had the touch /forcefsck && reboot option decide that a fix was too risky and refuse to do it. The manual run would then fix it. Typically booting single user mode was enough to sort it out. If the problem disk was the root fs, then rescue media was the solution. We did have an iscsi array reboot which caused the filesystem to go read-only and at the time, we ran fsck -n to check for any errors. We did get a few errors of the type that you'd expect from a filesystem that is mounted, but not any inode or bitmap errors. We also had a hyper-v vm get in a wedged state because the backup mechanism called the filesystem freeze (fsfreeze) and then the backup software crashed and never unfroze the filesystem. We had to update the backup software and the hyper-v drivers for that. The only time I couldn't get fsck to behave was when a couple of systems had faulty RAM. In those cases the filesystem corruption was severe and it was easier to replace memory and reimage/restore from backups. So, I don't think fsck is showing false positives. You should be able to clear the errors with a manual fsck and I would definitely be concerned that a number of systems were showing fs errors. If you can't get the manual fsck to fix all of the errors, it might be worth opening a support ticket with RedHat. Hugh From bsawyers at vt.edu Fri Dec 22 16:30:52 2017 From: bsawyers at vt.edu (Brandon Sawyers) Date: Fri, 22 Dec 2017 16:30:52 +0000 Subject: [rhelv6-list] Unsubscribe In-Reply-To: References: Message-ID: On Fri, Dec 22, 2017, 11:29 Brown, Hugh M wrote: > Response at bottom > > -----Original Message----- > From: rhelv6-list-bounces at redhat.com [mailto: > rhelv6-list-bounces at redhat.com] On Behalf Of francis picabia > Sent: Thursday, December 21, 2017 9:47 AM > To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list < > rhelv6-list at redhat.com> > Subject: Re: [rhelv6-list] fsck -n always showing errors > > Thanks for the replies... > > > OK, I was expecting there must be some sort of false positive going on. > > For the system I listed here, those are not persistent errors. > > > However there is one which does show the same orphaned inode numbers > > on each run, so this is likely real. > > # fsck -n /var > fsck from util-linux-ng 2.17.2 > e2fsck 1.41.12 (17-May-2010) > Warning! /dev/sda2 is mounted. > Warning: skipping journal recovery because doing a read-only filesystem > check. > /dev/sda2 contains a file system with errors, check forced. > Pass 1: Checking inodes, blocks, and sizes Deleted inode 1059654 has zero > dtime. Fix? no > > Inodes that were part of a corrupted orphan linked list found. Fix? no > > Inode 1061014 was part of the orphaned inode list. IGNORED. > Inode 1061275 was part of the orphaned inode list. IGNORED. > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information Block bitmap differences: > -124293 -130887 -4244999 -4285460 -4979711 -4984408 -4989489 -7052754 > -7052847 -7053693 -7069384 -7069539 -7069657 -7069788 -7074507 > -(7095835--7095839) -7096847 -7097195 -9626336 Fix? no > > Free blocks count wrong (6918236, counted=5214069). > Fix? no > > Inode bitmap differences: -1059654 -1061014 -1061275 Fix? no > > Free inodes count wrong (1966010, counted=1878618). > Fix? no > > > /dev/sda2: ********** WARNING: Filesystem still has errors ********** > > /dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000 > blocks > > > dmesg shows it had some scsi issues. I suspect the scsi error > > is triggered by operation of VDP backup, which freezes the system > > for a second when completing the backup snapshot. > > sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0 sd 2:0:0:0: [sda] > task abort on host 2, ffff880036e61ac0 sd 2:0:0:0: [sda] task abort on host > 2, ffff880036e614c0 sd 2:0:0:0: [sda] task abort on host 2, > ffff880036e61cc0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0 > sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0 sd 2:0:0:0: [sda] > task abort on host 2, ffff880036e616c0 sd 2:0:0:0: [sda] task abort on host > 2, ffff880036e615c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0 > INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds. > Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > jbd2/sda2-8 D 0000000000000000 0 752 2 0x00000000 > ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb > ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f > ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8 Call > Trace: > [] ? scsi_request_fn+0xdb/0x750 [] ? > read_tsc+0x9/0x20 [] ? ktime_get_ts+0xbf/0x100 > [] ? sync_buffer+0x0/0x50 [] > io_schedule+0x73/0xc0 [] sync_buffer+0x40/0x50 > [] __wait_on_bit+0x5f/0x90 [] ? > sync_buffer+0x0/0x50 [] > out_of_line_wait_on_bit+0x78/0x90 [] ? > wake_bit_function+0x0/0x50 [] ? bit_waitqueue+0x17/0xd0 > [] __wait_on_buffer+0x26/0x30 [] > jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2] [] ? > try_to_del_timer_sync+0x7b/0xe0 [] kjournald2+0xb8/0x220 > [jbd2] [] ? autoremove_wake_function+0x0/0x40 > [] ? kjournald2+0x0/0x220 [jbd2] [] > kthread+0x9e/0xc0 [] child_rip+0xa/0x20 > [] ? kthread+0x0/0xc0 [ 280>] ? child_rip+0x0/0x20 > INFO: task master:1778 blocked for more than 120 seconds. > Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > master D 0000000000000000 0 1778 1 0x00000080 > ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460 > 00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001 > ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8 Call > Trace: > [] ? sync_buffer+0x0/0x50 [] > io_schedule+0x73/0xc0 [] sync_buffer+0x40/0x50 > [] __wait_on_bit_lock+0x5a/0xc0 [] ? > sync_buffer+0x0/0x50 [] > out_of_line_wait_on_bit_lock+0x78/0x90 > [] ? wake_bit_function+0x0/0x50 [] ? > __find_get_block+0xa9/0x200 [] __lock_buffer+0x36/0x40 > [] do_get_write_access+0x48b/0x520 [jbd2] > [] jbd2_journal_get_write_access+0x31/0x50 [jbd2] > [] __ext4_journal_get_write_access+0x38/0x80 [ext4] > [] ext4_reserve_inode_write+0x73/0xa0 [ext4] > [] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] > [] ? jbd2_journal_start+0xb5/0x100 [jbd2] > [] ext4_dirty_inode+0x40/0x60 [ext4] > [] __mark_inode_dirty+0x3b/0x1c0 [] > file_update_time+0xf2/0x170 [] pipe_write+0x312/0x6b0 > [] do_sync_write+0xfa/0x140 [] ? > autoremove_wake_function+0x0/0x40 [] ? > cp_new_stat+0xe4/0x100 [] ? read_tsc+0x9/0x20 > [] ? ktime_get_ts+0xbf/0x100 [ e06>] ? security_file_permission+0x16/0x20 > [] vfs_write+0xb8/0x1a0 [] ? > fget_light_pos+0x16/0x50 [] sys_write+0x51/0xb0 > [] ? __audit_syscall_exit+0x25e/0x290 > [] system_call_fastpath+0x16/0x1b > INFO: task pickup:1236 blocked for more than 120 seconds. > Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > pickup D 0000000000000001 0 1236 1778 0x00000080 > ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120 > ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120 > ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8 Call > Trace: > [] ? __lru_cache_add+0x40/0x90 [] ? > sync_buffer+0x0/0x50 [] io_schedule+0x73/0xc0 > [] sync_buffer+0x40/0x50 [] > __wait_on_bit_lock+0x5a/0xc0 [] ? sync_buffer+0x0/0x50 > [] out_of_line_wait_on_bit_lock+0x78/0x90 > [] ? wake_bit_function+0x0/0x50 [] ? > __find_get_block+0xa9/0x200 [] __lock_buffer+0x36/0x40 > [] do_get_write_access+0x48b/0x520 [jbd2] > [] jbd2_journal_get_write_access+0x31/0x50 [jbd2] > [] __ext4_journal_get_write_access+0x38/0x80 [ext4] > [] ext4_reserve_inode_write+0x73/0xa0 [ext4] > [] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] > [] ? jbd2_journal_start+0xb5/0x100 [jbd2] > [] ext4_dirty_inode+0x40/0x60 [ext4] > [] __mark_inode_dirty+0x3b/0x1c0 [] > touch_atime+0x195/0x1a0 [] pipe_read+0x3e4/0x4d0 > [] do_sync_read+0xfa/0x140 [] ? > ep_send_events_proc+0x0/0x110 [] ? > autoremove_wake_function+0x0/0x40 [] ? > security_file_permission+0x16/0x20 > [] vfs_read+0xb5/0x1a0 [] ? > fget_light_pos+0x16/0x50 [] sys_read+0x51/0xb0 > [] ? __audit_syscall_exit+0x25e/0x290 > [] system_call_fastpath+0x16/0x1b sd 2:0:0:0: [sda] task > abort on host 2, ffff880036d7a680 sd 2:0:0:0: [sda] Failed to get > completion for aborted cmd ffff880036d7a680 sd 2:0:0:0: [sda] SCSI device > reset on scsi2:0 > > > If I just repair systems with that in their runtime history I should be on > target for any concerns. > > > Thanks for the responses... > > > > > I've never really had fsck fail to correct errors when run manually. I > have had the touch /forcefsck && reboot option decide that a fix was too > risky and refuse to do it. The manual run would then fix it. Typically > booting single user mode was enough to sort it out. If the problem disk was > the root fs, then rescue media was the solution. > > We did have an iscsi array reboot which caused the filesystem to go > read-only and at the time, we ran fsck -n to check for any errors. We did > get a few errors of the type that you'd expect from a filesystem that is > mounted, but not any inode or bitmap errors. > > We also had a hyper-v vm get in a wedged state because the backup > mechanism called the filesystem freeze (fsfreeze) and then the backup > software crashed and never unfroze the filesystem. We had to update the > backup software and the hyper-v drivers for that. > > The only time I couldn't get fsck to behave was when a couple of systems > had faulty RAM. In those cases the filesystem corruption was severe and it > was easier to replace memory and reimage/restore from backups. > > So, I don't think fsck is showing false positives. You should be able to > clear the errors with a manual fsck and I would definitely be concerned > that a number of systems were showing fs errors. > > If you can't get the manual fsck to fix all of the errors, it might be > worth opening a support ticket with RedHat. > > Hugh > > > > _______________________________________________ > rhelv6-list mailing list > rhelv6-list at redhat.com > https://www.redhat.com/mailman/listinfo/rhelv6-list > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fpicabia at gmail.com Fri Dec 22 19:43:14 2017 From: fpicabia at gmail.com (francis picabia) Date: Fri, 22 Dec 2017 15:43:14 -0400 Subject: [rhelv6-list] fsck -n always showing errors In-Reply-To: References: Message-ID: On Fri, Dec 22, 2017 at 12:29 PM, Brown, Hugh M wrote: > Response at bottom > > -----Original Message----- > From: rhelv6-list-bounces at redhat.com [mailto:rhelv6-list-bounces@ > redhat.com] On Behalf Of francis picabia > Sent: Thursday, December 21, 2017 9:47 AM > To: Red Hat Enterprise Linux 6 (Santiago) discussion mailing-list < > rhelv6-list at redhat.com> > Subject: Re: [rhelv6-list] fsck -n always showing errors > > Thanks for the replies... > > > OK, I was expecting there must be some sort of false positive going on. > > For the system I listed here, those are not persistent errors. > > > However there is one which does show the same orphaned inode numbers > > on each run, so this is likely real. > > # fsck -n /var > fsck from util-linux-ng 2.17.2 > e2fsck 1.41.12 (17-May-2010) > Warning! /dev/sda2 is mounted. > Warning: skipping journal recovery because doing a read-only filesystem > check. > /dev/sda2 contains a file system with errors, check forced. > Pass 1: Checking inodes, blocks, and sizes Deleted inode 1059654 has zero > dtime. Fix? no > > Inodes that were part of a corrupted orphan linked list found. Fix? no > > Inode 1061014 was part of the orphaned inode list. IGNORED. > Inode 1061275 was part of the orphaned inode list. IGNORED. > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information Block bitmap differences: > -124293 -130887 -4244999 -4285460 -4979711 -4984408 -4989489 -7052754 > -7052847 -7053693 -7069384 -7069539 -7069657 -7069788 -7074507 > -(7095835--7095839) -7096847 -7097195 -9626336 Fix? no > > Free blocks count wrong (6918236, counted=5214069). > Fix? no > > Inode bitmap differences: -1059654 -1061014 -1061275 Fix? no > > Free inodes count wrong (1966010, counted=1878618). > Fix? no > > > /dev/sda2: ********** WARNING: Filesystem still has errors ********** > > /dev/sda2: 598086/2564096 files (1.5% non-contiguous), 3321764/10240000 > blocks > > > dmesg shows it had some scsi issues. I suspect the scsi error > > is triggered by operation of VDP backup, which freezes the system > > for a second when completing the backup snapshot. > > sd 2:0:0:0: [sda] task abort on host 2, ffff880036e618c0 sd 2:0:0:0: [sda] > task abort on host 2, ffff880036e61ac0 sd 2:0:0:0: [sda] task abort on host > 2, ffff880036e614c0 sd 2:0:0:0: [sda] task abort on host 2, > ffff880036e61cc0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e61dc0 > sd 2:0:0:0: [sda] task abort on host 2, ffff880036e617c0 sd 2:0:0:0: [sda] > task abort on host 2, ffff880036e616c0 sd 2:0:0:0: [sda] task abort on host > 2, ffff880036e615c0 sd 2:0:0:0: [sda] task abort on host 2, ffff880036e613c0 > INFO: task jbd2/sda2-8:752 blocked for more than 120 seconds. > Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > jbd2/sda2-8 D 0000000000000000 0 752 2 0x00000000 > ffff880037ac7c20 0000000000000046 ffff880037ac7bd0 ffffffff813a27eb > ffff880037ac7b80 ffffffff81014b39 ffff880037ac7bd0 ffffffff810b2a4f > ffff880036c44138 0000000000000000 ffff880037a69068 ffff880037ac7fd8 Call > Trace: > [] ? scsi_request_fn+0xdb/0x750 [] ? > read_tsc+0x9/0x20 [] ? ktime_get_ts+0xbf/0x100 > [] ? sync_buffer+0x0/0x50 [] > io_schedule+0x73/0xc0 [] sync_buffer+0x40/0x50 > [] __wait_on_bit+0x5f/0x90 [] ? > sync_buffer+0x0/0x50 [] out_of_line_wait_on_bit+0x78/0x90 > [] ? wake_bit_function+0x0/0x50 [] ? > bit_waitqueue+0x17/0xd0 [] __wait_on_buffer+0x26/0x30 > [] jbd2_journal_commit_transaction+0xaa6/0x14f0 [jbd2] > [] ? try_to_del_timer_sync+0x7b/0xe0 > [] kjournald2+0xb8/0x220 [jbd2] [] ? > autoremove_wake_function+0x0/0x40 [] ? > kjournald2+0x0/0x220 [jbd2] [] kthread+0x9e/0xc0 > [] child_rip+0xa/0x20 [] ? > kthread+0x0/0xc0 [ 280>] ? child_rip+0x0/0x20 > INFO: task master:1778 blocked for more than 120 seconds. > Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > master D 0000000000000000 0 1778 1 0x00000080 > ffff8800ba0cb948 0000000000000082 0000000000000000 ffff88000003e460 > 00000037ffffffc8 0000004100000000 001744a7cc279bbf 0000000000000001 > ffff8800ba0c8000 00000002863b16d4 ffff880037a55068 ffff8800ba0cbfd8 Call > Trace: > [] ? sync_buffer+0x0/0x50 [] > io_schedule+0x73/0xc0 [] sync_buffer+0x40/0x50 > [] __wait_on_bit_lock+0x5a/0xc0 [] ? > sync_buffer+0x0/0x50 [] out_of_line_wait_on_bit_lock+ > 0x78/0x90 > [] ? wake_bit_function+0x0/0x50 [] ? > __find_get_block+0xa9/0x200 [] __lock_buffer+0x36/0x40 > [] do_get_write_access+0x48b/0x520 [jbd2] > [] jbd2_journal_get_write_access+0x31/0x50 [jbd2] > [] __ext4_journal_get_write_access+0x38/0x80 [ext4] > [] ext4_reserve_inode_write+0x73/0xa0 [ext4] > [] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] > [] ? jbd2_journal_start+0xb5/0x100 [jbd2] > [] ext4_dirty_inode+0x40/0x60 [ext4] > [] __mark_inode_dirty+0x3b/0x1c0 [] > file_update_time+0xf2/0x170 [] pipe_write+0x312/0x6b0 > [] do_sync_write+0xfa/0x140 [] ? > autoremove_wake_function+0x0/0x40 [] ? > cp_new_stat+0xe4/0x100 [] ? read_tsc+0x9/0x20 > [] ? ktime_get_ts+0xbf/0x100 [ e06>] ? security_file_permission+0x16/0x20 > [] vfs_write+0xb8/0x1a0 [] ? > fget_light_pos+0x16/0x50 [] sys_write+0x51/0xb0 > [] ? __audit_syscall_exit+0x25e/0x290 > [] system_call_fastpath+0x16/0x1b > INFO: task pickup:1236 blocked for more than 120 seconds. > Not tainted 2.6.32-696.3.2.el6.x86_64 #1 "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > pickup D 0000000000000001 0 1236 1778 0x00000080 > ffff880024c6f968 0000000000000086 0000000000000000 ffffea00019e4120 > ffff880024c6f8e8 ffffffff811456e0 001744a7cc27fe9e ffffea00019e4120 > ffff8800117ab4a8 00000002863b1637 ffff88003738d068 ffff880024c6ffd8 Call > Trace: > [] ? __lru_cache_add+0x40/0x90 [] ? > sync_buffer+0x0/0x50 [] io_schedule+0x73/0xc0 > [] sync_buffer+0x40/0x50 [] > __wait_on_bit_lock+0x5a/0xc0 [] ? sync_buffer+0x0/0x50 > [] out_of_line_wait_on_bit_lock+0x78/0x90 > [] ? wake_bit_function+0x0/0x50 [] ? > __find_get_block+0xa9/0x200 [] __lock_buffer+0x36/0x40 > [] do_get_write_access+0x48b/0x520 [jbd2] > [] jbd2_journal_get_write_access+0x31/0x50 [jbd2] > [] __ext4_journal_get_write_access+0x38/0x80 [ext4] > [] ext4_reserve_inode_write+0x73/0xa0 [ext4] > [] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] > [] ? jbd2_journal_start+0xb5/0x100 [jbd2] > [] ext4_dirty_inode+0x40/0x60 [ext4] > [] __mark_inode_dirty+0x3b/0x1c0 [] > touch_atime+0x195/0x1a0 [] pipe_read+0x3e4/0x4d0 > [] do_sync_read+0xfa/0x140 [] ? > ep_send_events_proc+0x0/0x110 [] ? > autoremove_wake_function+0x0/0x40 [] ? > security_file_permission+0x16/0x20 > [] vfs_read+0xb5/0x1a0 [] ? > fget_light_pos+0x16/0x50 [] sys_read+0x51/0xb0 > [] ? __audit_syscall_exit+0x25e/0x290 > [] system_call_fastpath+0x16/0x1b sd 2:0:0:0: [sda] task > abort on host 2, ffff880036d7a680 sd 2:0:0:0: [sda] Failed to get > completion for aborted cmd ffff880036d7a680 sd 2:0:0:0: [sda] SCSI device > reset on scsi2:0 > > > If I just repair systems with that in their runtime history I should be on > target for any concerns. > > > Thanks for the responses... > > > > > I've never really had fsck fail to correct errors when run manually. I > have had the touch /forcefsck && reboot option decide that a fix was too > risky and refuse to do it. The manual run would then fix it. Typically > booting single user mode was enough to sort it out. If the problem disk was > the root fs, then rescue media was the solution. > > We did have an iscsi array reboot which caused the filesystem to go > read-only and at the time, we ran fsck -n to check for any errors. We did > get a few errors of the type that you'd expect from a filesystem that is > mounted, but not any inode or bitmap errors. > > We also had a hyper-v vm get in a wedged state because the backup > mechanism called the filesystem freeze (fsfreeze) and then the backup > software crashed and never unfroze the filesystem. We had to update the > backup software and the hyper-v drivers for that. > > The only time I couldn't get fsck to behave was when a couple of systems > had faulty RAM. In those cases the filesystem corruption was severe and it > was easier to replace memory and reimage/restore from backups. > > So, I don't think fsck is showing false positives. You should be able to > clear the errors with a manual fsck and I would definitely be concerned > that a number of systems were showing fs errors. > > If you can't get the manual fsck to fix all of the errors, it might be > worth opening a support ticket with RedHat. > > Hugh > > > This topic has a degree of "You Mileage May Vary". Yes, some file system problems with real physical disk errors will be difficult or sometimes even impossible to recover from. It depends how serious the flaws are. If it is only a power loss situation, then the transaction rollback should do the trick. If it is media or a controller or other hardware causing interrupts, it is anything from 4 to 9 on the Richter scale - maybe just some loss in the log files, or database files might be corrupted. I say fsck is showing false positives because it is doing evaluations while the file system is changing. For minor items like block size counts or missing mod time, this would be typical of looking at the file system while there are writes. If you doubt it, try fsck -n /var on a system having an active website or database. See for yourself system what it reports. I can tell you almost every system's /var checked with Redhat was not clean. I'm talking about production servers and the like, not one's relatively quiet desktop system. -------------- next part -------------- An HTML attachment was scrubbed... URL: