<html> <body> At 07:00 PM 10/12/2005, Michael Loftis wrote: <blockquote type=cite class=cite cite="">Both of these sound more like RAID problems, not LVM. What sort of RAID are you using? MD? If not MD what RAID controller are you using?</blockquote> Both of these failures are on Redhat ES 4.1 systems using MD. Both are testing prior to Oracle installs. In the second system, the x86 system that loops through a reboot without giving me access to the problem file system to fsck: /dev/VolGroup00/LogVol02 UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY (i.e. without -a or -p options) ... Give root password for maintenance (or type Control-D to continue) <and then reboots in any case> it is quite a simple install with 2 36GB SCSI disks. Each has a RAID-1 mirror for /boot and a RAID-1 mirror for VolGroup00 that includes LogVol02 for /, LogVol01 for swap, and LogVol00 for /var. In the kickstart syntax used to build the system: clearpart --all --initlabel part raid.10 --size=1024 --ondisk=sda --asprimary part raid.11 --size=1024 --ondisk=sdb --asprimary part raid.20 --size=1024 --grow --ondisk=sda part raid.21 --size=1024 --grow --ondisk=sdb raid /boot --fstype ext2 --level=RAID1 raid.10 raid.11 raid pv.1000 --level=RAID1 raid.20 raid.21 volgroup VolGroup00 --pesize=32768 pv.1000 logvol / --fstype ext3 --name=LogVol02 --vgname=VolGroup00 --size=10240 --grow logvol /var --fstype ext3 --name=LogVol01 --vgname=VolGroup00 --size=8192 logvol swap --fstype swap --name=LogVol00 --vgname=VolGroup00 --size=2048 While it says to run fsck manually, when I bring up the linux rescue system from the CD (Redhat ES 4.1) there is no /dev/VolGroup00/LogVol02 file system to run fsck on. Apparently the LVM layer hasn't made it available? What next? While you suggest that these sound like RAID problems, I've been using MD RAID on many systems (15+) for 3-4 years now, mostly on Redhat ES 2.1 systems, without any problems of this sort. During that time I've had numerous disk failures (12-15) and even a controller failure and I've always been able to recover without even taking systems out of production. I've never had a problem like this where I couldn't recover a file system or even boot a system - though I could deal with that. These problems might be related to a combination of LVM over MD RAID, but that's the way I need to run these systems. If I have to give up either MD RAID or LVM, at this time I choose to give up LVM - "half baked". My problem could also be a result of some ignorance on my part about LVM. That's why I'm posting these messages. I'd be delighted if somebody would say something to the effect that, "Didn't you know that you can use xyz to make that logical volume visible and then run fsck on it?" In terms of replicating such a problem for testing and correcting, r.e.: At 04:18 AM 10/13/2005, Robin Green wrote: <blockquote type=cite class=cite cite="">Did you file a bug about this? It's rather hard to fix bugs if people don't file reproducable test cases in the relevant bug database.</blockquote> It is indeed 'hard' as you say. In the above case there were two hard disk failures (/dev/sdb) that precipitated the problem. After the first disk failure I put in a replacement disk and apparently synchronized the RAID-1 pair successfully. However, some time about 1/2 day later (I often use rather old disks for early development projects like this) that replacement /dev/sdb also failed. It was after that failure that I was unable to boot the system off /dev/sda - an apparently still fully functional disk (I tested it looking at it both from another system and of course from the linux rescue system from the CD) and am stuck unable to fsck the / file system that's nominally in /dev/VolGroup00/LogVol02 as above. Of course I have the disk, so I can replicate the problem, as evidence by the final state of that disk, at will. I could dd the 36GB of that disk to a file that I could make available (e.g. on the Web) so that somebody else could copy it to an equivalent disk and replicate the problem or I could even send that disk to somebody who was serious about working on the problem - assuming I could get security approval to do so. There wasn't much relevant on that disk when the problem occurred. I'm about out of ideas for working on it. Here's what fdisk (when viewed from an essentially identically configured system which is of course working) says about that disk in case anybody would like to know more details about its configuration when considering the dd proposal above: [root@helpb ~]# fdisk /dev/sdb The number of cylinders for this disk is set to 4462. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Command (m for help): p Disk /dev/sdb: 36.7 GB, 36703918080 bytes 255 heads, 63 sectors/track, 4462 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdb1 * 1 131 1052226 fd Linux raid autodetect /dev/sdb2 132 4462 34788757+ fd Linux raid autodetect Command (m for help): q [root@helpb ~]# When you refer to filing "reproducible test cases in the relevant bug database", what do you suggest I do in this case? My experience with such bug databases has been poor. To me they have mostly looked like black holes - though I admit that's with a small number of experiences (~5) and there are some exceptions (e.g. Sun Microsystems seems to follow up pretty well). Still, I don't consider it a good use of my time or particularly useful for LVM. The basic problem is, How do I get at that logical volume file system to work on it - e.g. to recover it? The other apparent LVM failure that I'm dealing with is a bit more problematic to replicate. It's similar in some ways (the base partitioning of /dev/sda and /dev/sdb are similar but on 146GB SCSI disks). It also has a RAID-10 configuration (mentioned earlier in the sysinit bug) on four other disks. However, the RAID-10 just holds a data logical volume that the system can, in principle, come up without. I think I need to do a bit more work on that system trying to recover it before I send even more email about it. It's also a 64 bit system that might complicate things a bit. I was hoping to get lucky and perhaps have somebody recognize it's symptoms: ... 4 logical volume(s) in volume group "VolGroup00" now active ERROR: failed in exec of defaults ERROR: failed in exec of ext3 mount: error 2 mounting none switchroot: mount failed: 23 ERROR: ext3 exited abnormally! (pid 284) ... <three more similar to the above> kernel panic - not syncing: Attempted to kill init! and have some ideas on routes to pursue to try to recover that system. Again it is a test system, but if I can't recover problems with test systems I certainly don't want to run the same LVM software in our production systems. I consider myself lucky to have run into such problems while testing. I sent my initial message partly in the hope somebody might have ideas I could use to try to recover these systems and partly to share my experiences so others might better be able to evaluate whether they want to go use LVM on their production systems. <x-sigsep></x-sigsep> --Jed <a href="http://www.nersc.gov/~jed/" eudora="autourl"> http://www.nersc.gov/~jed/</a></body> </html>