From phillips at redhat.com Fri Apr 1 00:15:48 2005 From: phillips at redhat.com (Daniel Phillips) Date: Thu, 31 Mar 2005 19:15:48 -0500 Subject: [Linux-cluster] new dlm control/configuration In-Reply-To: <20050331082707.GA27996@redhat.com> References: <20050331071043.GD7190@redhat.com> <20050331074607.GL17350@marowsky-bree.de> <20050331082707.GA27996@redhat.com> Message-ID: <200503311915.48154.phillips@redhat.com> Hi Dave On Thursday 31 March 2005 03:27, David Teigland wrote: > ...the mechanism used to export the locking API to user space is > pretty inconsequential. We're doing reads/writes on a misc device at > the moment (used through libdlm of course.) Going through an fs > might be better but I'm not sure why. Please stick with the socket connection on the misc device. It is efficient and simple. If somebody wants to write a pseudo filesystem for it they can go through the socket. Regards, Daniel From phillips at redhat.com Fri Apr 1 00:21:02 2005 From: phillips at redhat.com (Daniel Phillips) Date: Thu, 31 Mar 2005 19:21:02 -0500 Subject: [Linux-cluster] new dlm control/configuration In-Reply-To: <20050331071043.GD7190@redhat.com> References: <20050331071043.GD7190@redhat.com> Message-ID: <200503311921.02945.phillips@redhat.com> Hi Dave, On Thursday 31 March 2005 02:10, David Teigland wrote: > A new command line program, dlm_tool, can be used to set up the dlm > manually in which case it depends on no other software (much like > using dmsetup with device-mapper.) Then what would be wrong with calling it dlmsetup? Regards, Daniel From phillips at redhat.com Fri Apr 1 00:31:45 2005 From: phillips at redhat.com (Daniel Phillips) Date: Thu, 31 Mar 2005 19:31:45 -0500 Subject: [Linux-cluster] new dlm control/configuration In-Reply-To: <20050331071043.GD7190@redhat.com> References: <20050331071043.GD7190@redhat.com> Message-ID: <200503311931.45491.phillips@redhat.com> Hi Dave, On Thursday 31 March 2005 02:10, David Teigland wrote: > dlm_tool to configure/control the dlm manually: > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/dlm/?cvsroot=clu >ster re: set_local [] I know that it's possible to have multiple ip addresses on a given node and that the nodeid is not necessarily the hostname. However, it would be very nice to default to this and only need to use the set_local command to specify something more exotic. Regards, Daniel From phillips at redhat.com Fri Apr 1 00:36:23 2005 From: phillips at redhat.com (Daniel Phillips) Date: Thu, 31 Mar 2005 19:36:23 -0500 Subject: [Linux-cluster] new dlm control/configuration In-Reply-To: <20050331071043.GD7190@redhat.com> References: <20050331071043.GD7190@redhat.com> Message-ID: <200503311936.23400.phillips@redhat.com> Hi Dave, On Thursday 31 March 2005 02:10, David Teigland wrote: > dlm_tool to configure/control the dlm manually: > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/dlm/?cvsroot=clu >ster re: get_done For the most part, the purpose of each of these commands is clear from its name, but not this one. You could cure this by calling it "wait_result" or similar. Regards, Daniel From phillips at redhat.com Fri Apr 1 00:41:57 2005 From: phillips at redhat.com (Daniel Phillips) Date: Thu, 31 Mar 2005 19:41:57 -0500 Subject: [Linux-cluster] new dlm control/configuration In-Reply-To: <20050331071043.GD7190@redhat.com> References: <20050331071043.GD7190@redhat.com> Message-ID: <200503311941.57259.phillips@redhat.com> Hi Dave, On Thursday 31 March 2005 02:10, David Teigland wrote: > Hi Dave, > 3. On each node we first need to tell the dlm what the local IP > address and nodeid are: > > nodea> dlm_tool set_local 1 10.0.0.1 > nodeb> dlm_tool set_local 2 10.0.0.2 > nodec> dlm_tool set_local 3 10.0.0.3 But we have a sophisticated messaging system. Why do we have to tell all the local IP addresses to each node? Regards, Daniel From bojan at rexursive.com Fri Apr 1 02:49:54 2005 From: bojan at rexursive.com (Bojan Smojver) Date: Fri, 1 Apr 2005 12:49:54 +1000 Subject: [Linux-cluster] LOCK_USE_CLNT Message-ID: <20050401124954.x9akojwy880sk888@imp.rexursive.com> In the latest gfs-kernel tarball (gfs-kernel-2.6.9-28), there are still references to this undefined symbol (apparently removed from 2.6.9-rc4, file include/linux/fs.h of the kernel). Is this supposed to exist somewhere? GFS kernel stuff doesn't like to be compiled without it... -- Bojan From phillips at redhat.com Fri Apr 1 05:48:01 2005 From: phillips at redhat.com (Daniel Phillips) Date: Fri, 1 Apr 2005 00:48:01 -0500 Subject: [Linux-cluster] DDraid benchmarks (epilogue) In-Reply-To: <200503291016.49403.phillips@redhat.com> References: <200503141717.19595.phillips@redhat.com> <200503291016.49403.phillips@redhat.com> Message-ID: <200504010048.02097.phillips@redhat.com> I looked into the cause of the ddraid oops noted in the earlier benchmark posting. It turned out to be just the fact that nothing prevents the dm device from being removed while there is still deferred timer IO pending. I filled out the missing benchmark table entries by just not removing the device. the correct fix is probably to teach ddraid's destroy method to wait patiently until all the child events complete. Alternatively, we could think about a higher level dm mechanism that understands how to wait for pending events other than just IO transfers before calling the destroy method. Or I can just hack this into the destroy method for now and think about lifting it up into device mapper later. Anyway, the missing numbers are for all overhead enabled on the ddraid order 2, in other words, the most interesting numbers. The overheads in question are the parity calculations (calc) and the shared persistent dirty log (sync). We see that in this case ddraid finishes the tar test dead even with IO to the raw disk. But ddraid is doing more of course, it is running the dirty log, futzing with bio vectors and calculating parity on read and write. So the dirty log is very efficient, even in the lots-of-small-transfers case. In the nonfragmented IO case, ddraid does very well, as before. Even with the dirty logging, ddraid order 2 is more than twice as fast as a single raw disk. -------------------- untar linux-2.6.11.3 -------------------- raw scsi disk process: real 48.994s user 45.526s sys 3.063s umount: real 3.084s user 0.002s sys 0.429s ddraid order 1, no calc, no sync process: real 49.942s user 46.328s sys 3.028s umount: real 2.034s user 0.005s sys 0.626s ddraid order 1, calc, no sync process: real 50.864s user 46.221s sys 3.195s umount: real 1.839s user 0.006s sys 1.099s ddraid order 1, calc, sync process: real 50.979s user 46.382s sys 3.222s umount: real 1.895s user 0.002s sys 0.531s ddraid order 2, no calc, no sync process: real 49.532s user 45.837s sys 3.145s umount: real 1.318s user 0.004s sys 0.718s ddraid order 2, calc, no sync process: real 49.742s user 45.527s sys 3.135s umount: real 1.625s user 0.004s sys 1.054s ddraid order 2, no calc, sync process: real 50.620s user 46.285s sys 3.122s umount: real 1.293s user 0.003s sys 1.103s ddraid order 2, calc, sync process: real 50.832s user 46.495s sys 3.084s umount: real 1.437s user 0.004s sys 0.787s --------------------------------- cp /zoo/linux-2.6.11.3.tar.bz2 /x --------------------------------- raw scsi disk process: real 0.258s user 0.008s sys 0.236s umount: real 1.019s user 0.003s sys 0.032s raw scsi disk (again) process: real 0.264s user 0.013s sys 0.237s umount: real 1.053s user 0.005s sys 0.029s raw scsi disk (again) process: real 0.267s user 0.018s sys 0.233s umount: real 1.019s user 0.006s sys 0.028s ddraid order 1, calc, no sync process: real 0.267s user 0.007s sys 0.243s umount: real 0.568s user 0.006s sys 0.250s ddraid order 1, no calc, sync process: real 0.267s user 0.011s sys 0.240s umount: real 0.608s user 0.002s sys 0.032s ddraid order 1, calc, sync process: real 0.265s user 0.008s sys 0.239s umount: real 0.596s user 0.004s sys 0.042s ddraid order 2, no calc, no sync process: real 0.266s user 0.013s sys 0.234s umount: real 0.381s user 0.004s sys 0.049s ddraid order 2, calc, no sync process: real 0.269s user 0.010s sys 0.239s umount: real 0.392s user 0.004s sys 0.201s ddraid order 2, no calc, sync process: real 0.261s user 0.004s sys 0.244s umount: real 0.437s user 0.003s sys 0.195s ddraid order 2, calc, sync process: real 0.266s user 0.009s sys 0.240s umount: real 0.441s user 0.007s sys 0.026s From patrick at tykepenguin.com Fri Apr 1 10:51:44 2005 From: patrick at tykepenguin.com (Patrick Caulfield) Date: Fri, 1 Apr 2005 11:51:44 +0100 Subject: [Linux-cluster] new dlm control/configuration In-Reply-To: <20050331215036.GE1334@ca-server1.us.oracle.com> References: <20050331071043.GD7190@redhat.com> <20050331074607.GL17350@marowsky-bree.de> <20050331082707.GA27996@redhat.com> <20050331083751.GB23452@tykepenguin.com> <20050331215036.GE1334@ca-server1.us.oracle.com> Message-ID: <20050401105143.GA8720@tykepenguin.com> On Thu, Mar 31, 2005 at 01:50:36PM -0800, Mark Fasheh wrote: > Well it's actually quite clean in ocfs2_dlmfs, part of that is likely > related to some design calls we made early on to simplify our userspace > locking. We don't do ranges (anywhere really), and we consider all userspace > lock requests to be synchronous. This does however result in a userspace API > which is extremely lightweight and dirt simple to use. > > mkdir gives you a new domain, files created within that directory correspond > to lock resource with the same name. Open O_RDONLY gets you a PR mode lock, > open RDWR gives you an EX mode lock. You can do NOQUEUE (trylock) ops with > O_NONBLOCK. Reads and writes to the file return and set the LVB accordingly. > > One can literally, create a domain, create locks within it and ship data via > the LVB all from a bash shell on my cluster nodes. > > I was able to write a trivial library wrapper (for those who don't want to > use shell for controlling dlm functionality) in about 600 lines. > --Mark > That's interesting, thanks. As far as our DLM is concerned it's a very small subset of the full functionality (so it would never replace the existing device interface) but I can see it might be useful. -- patrick From bastian at waldi.eu.org Fri Apr 1 14:20:31 2005 From: bastian at waldi.eu.org (Bastian Blank) Date: Fri, 1 Apr 2005 16:20:31 +0200 Subject: [Linux-cluster] [PATCH] iddev - convert to general purpose device identifier Message-ID: <20050401142031.GA21976@wavehammer.waldi.eu.org> Hi folks iddev currently only returns a human readable string of the device content. This makes it rather unusable in contexts where you need to have to do decitions on the content. Changes: - Add three enums which describe the content. - identify_device get a struct which contains the enums and the human readable text. - Get the block size. The struct also contains a field for the uuid, but it is not set yet. - Add xfs support. - Make the ext2/3 check differ between ext2 and ext3. I currently try to do some cleanups in lvm2/fsadm. This tool currently relies on an entry in /etc/fstab to get the filesystem type and a temporary mount to get the block size. With this changes I can just use iddev to gather the informations. Bastian -- Conquest is easy. Control is not. -- Kirk, "Mirror, Mirror", stardate unknown -------------- next part -------------- Index: lib/iddev.h =================================================================== --- lib/iddev.h (revision 413) +++ lib/iddev.h (working copy) @@ -16,19 +16,64 @@ /** + * device_info - + */ + +enum device_info_family +{ + DEVICE_INFO_UNDEFINED_FAMILY = 0, + DEVICE_INFO_CONTAINER, + DEVICE_INFO_FILESYSTEM, + DEVICE_INFO_SWAP, +}; + +enum device_info_type +{ + DEVICE_INFO_UNDEFINED_TYPE = 0, + DEVICE_INFO_CONTAINER_CCA, + DEVICE_INFO_CONTAINER_CIDEV, + DEVICE_INFO_CONTAINER_LVM1, + DEVICE_INFO_CONTAINER_LVM2, + DEVICE_INFO_CONTAINER_PARTITION, + DEVICE_INFO_CONTAINER_POOL, + DEVICE_INFO_FILESYSTEM_EXT23, + DEVICE_INFO_FILESYSTEM_GFS, + DEVICE_INFO_FILESYSTEM_REISERFS, + DEVICE_INFO_FILESYSTEM_XFS, +}; + +enum device_info_subtype +{ + DEVICE_INFO_UNDEFINED_SUBTYPE = 0, + DEVICE_INFO_CONTAINER_PARTITION_MSDOS, + DEVICE_INFO_FILESYSTEM_EXT2, + DEVICE_INFO_FILESYSTEM_EXT3, +}; + +struct device_info +{ + enum device_info_family family; + enum device_info_type type; + enum device_info_subtype subtype; + + char display[128]; + unsigned char uuid[16]; + size_t block_size; +}; + +/** * indentify_device - figure out what's on a device * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type + * @info: a buffer * * The offset of @fd will be changed by the function. * This routine will not write to this device. * * Returns: -1 on error (with errno set), 1 if unabled to identify, - * 0 if device identified (with @type set) + * 0 if device identified (with @info set) */ -int identify_device(int fd, char *type, unsigned type_len); +int identify_device(int fd, struct device_info *info); /** Index: lib/identify_device.c =================================================================== --- lib/identify_device.c (revision 413) +++ lib/identify_device.c (working copy) @@ -50,7 +50,7 @@ int main(int argc, char *argv[]) { int fd; - char buf[BUFSIZE]; + struct device_info info; uint64 bytes; int error; @@ -63,18 +63,18 @@ if (fd < 0) die("can't open %s: %s\n", argv[1], strerror(errno)); - error = identify_device(fd, buf, BUFSIZE); + error = identify_device(fd, &info); if (error < 0) die("error identifying the contents of %s: %s\n", argv[1], strerror(errno)); else if (error) - strcpy(buf, "unknown"); + strcpy(info.display, "unknown"); error = device_size(fd, &bytes); if (error < 0) die("error determining the size of %s: %s\n", argv[1], strerror(errno)); printf("%s:\n%-15s%s\n%-15s%"PRIu64"\n", - argv[1], " contents:", buf, " bytes:", bytes); + argv[1], " contents:", info.display, " bytes:", bytes); close(fd); Index: lib/iddev.c =================================================================== --- lib/iddev.c (revision 413) +++ lib/iddev.c (working copy) @@ -25,24 +25,37 @@ #include "iddev.h" +static void info_set_display(struct device_info *info, const char *display) +{ + snprintf(info->display, sizeof (info->display), display); +} +static inline void info_set(struct device_info *info, const enum device_info_family family, const enum device_info_type type, const enum device_info_subtype subtype, const char *display) +{ + info->family = family; + info->type = type; + info->subtype = subtype; + info_set_display(info, display); +} +static inline void info_set_container(struct device_info *info, const enum device_info_type type, const enum device_info_subtype subtype, const char *display) +{ + info_set(info, DEVICE_INFO_CONTAINER, type, subtype, display); +} +static inline void info_set_filesystem(struct device_info *info, const enum device_info_type type, const enum device_info_subtype subtype, const char *display) +{ + info_set(info, DEVICE_INFO_FILESYSTEM, type, subtype, display); +} +typedef int check(int fd, struct device_info *info); + /** * check_for_gfs - check to see if GFS is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * An EINVAL returned from lseek means that the device was too - * small -- at least on Linux. - * - * Returns: -1 on error (with errno set), 1 if not GFS, - * 0 if GFS found (with type set) */ -static int check_for_gfs(int fd, char *type, unsigned type_len) +static check check_for_gfs; +static int check_for_gfs(int fd, struct device_info *info) { unsigned char buf[512]; uint32 *p = (uint32 *)buf; @@ -66,7 +79,7 @@ if (osi_be32_to_cpu(*p) != 0x01161970 || osi_be32_to_cpu(*(p + 1)) != 1) return 1; - snprintf(type, type_len, "GFS filesystem"); + info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_GFS, 0, "GFS filesystem"); return 0; } @@ -74,15 +87,10 @@ /** * check_for_pool - check to see if Pool is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not Pool, - * 0 if Pool found (with type set) */ -static int check_for_pool(int fd, char *type, unsigned type_len) +static check check_for_pool; +static int check_for_pool(int fd, struct device_info *info) { unsigned char buf[512]; uint64 *p = (uint64 *)buf; @@ -106,23 +114,18 @@ if (osi_be64_to_cpu(*p) != 0x11670) return 1; - snprintf(type, type_len, "Pool subdevice"); + info_set_container(info, DEVICE_INFO_CONTAINER_POOL, 0, "Pool subdevice"); return 0; } /** - * check_for_paritition - check to see if Partition is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not Partition, - * 0 if Partition found (with type set) + * check_for_msdos - check to see if Partition is on this device */ -static int check_for_partition(int fd, char *type, unsigned type_len) +static check check_for_partition_msdos; +static int check_for_partition_msdos(int fd, struct device_info *info) { unsigned char buf[512]; int error; @@ -145,29 +148,42 @@ if (buf[510] != 0x55 || buf[511] != 0xAA) return 1; - snprintf(type, type_len, "partition information"); + info_set_container(info, DEVICE_INFO_CONTAINER_PARTITION, DEVICE_INFO_CONTAINER_PARTITION_MSDOS, "MSDOS partition information"); return 0; } +enum +{ + BLOCK_SIZE_BITS = 10, + BLOCK_SIZE = (1 << BLOCK_SIZE_BITS), + EXT3_SUPER_MAGIC = 0xEF53, + EXT23_FEATURE_COMPAT_HAS_JOURNAL = 0x4, +}; +struct ext23_superblock +{ + uint32_t _r1[6]; /**< 0x00 - 0x14 */ + uint32_t s_log_block_size; /**< 0x18 */ + uint32_t _r2[7]; /**< 0x1c - 0x34 */ + uint16_t s_magic; /**< 0x38 */ + uint16_t s_state; /**< 0x3a */ + uint32_t _r3[8]; /**< 0x3c - 0x58 */ + uint32_t s_feature_compat; /**< 0x5c */ + uint32_t s_feature_incompat; /**< 0x60 */ + uint32_t s_feature_ro_compat; /**< 0x64 */ + uint8_t s_uuid[16]; /**< 0x68 - 0x77 */ +}; + /** * check_for_ext23 - check to see if EXT23 is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * An EINVAL returned from lseek means that the device was too - * small -- at least on Linux. - * - * Returns: -1 on error (with errno set), 1 if not EXT23, - * 0 if EXT23 found (with type set) */ -static int check_for_ext23(int fd, char *type, unsigned type_len) +static check check_for_ext23; +static int check_for_ext23(int fd, struct device_info *info) { unsigned char buf[512]; - uint16 *p = (uint16 *)buf; + struct ext23_superblock *p = (struct ext23_superblock *)buf; int error; error = lseek(fd, 1024, SEEK_SET); @@ -185,26 +201,78 @@ else if (error < 58) return 1; - if (osi_le16_to_cpu(p[28]) != 0xEF53) + if (osi_le16_to_cpu(p->s_magic) != EXT3_SUPER_MAGIC) return 1; - snprintf(type, type_len, "EXT2/3 filesystem"); + info->block_size = (BLOCK_SIZE << osi_le32_to_cpu(p->s_log_block_size)); + if (osi_le16_to_cpu(p->s_feature_compat) & EXT23_FEATURE_COMPAT_HAS_JOURNAL) + info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_EXT23, DEVICE_INFO_FILESYSTEM_EXT3, "EXT3 filesystem"); + else + info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_EXT23, DEVICE_INFO_FILESYSTEM_EXT2, "EXT2 filesystem"); + return 0; } +enum +{ + XFS_SB_MAGIC = 0x58465342, +}; + +struct xfs_superblock +{ + uint32_t sb_magicnum; + uint32_t sb_blocksize; + uint64_t sb_dblocks; + uint64_t sb_rblocks; + uint64_t sb_rextents; + uint8_t sb_uuid[16]; +}; + /** + * check_for_xfs - check to see if XFS is on this device + */ + +static check check_for_xfs; +static int check_for_xfs(int fd, struct device_info *info) +{ + unsigned char buf[512]; + struct xfs_superblock *p = (struct xfs_superblock *)buf; + int error; + + error = lseek(fd, 0, SEEK_SET); + if (error < 0) + return (errno == EINVAL) ? 1 : error; + else if (error != 0) + { + errno = EINVAL; + return -1; + } + + error = read(fd, buf, 512); + if (error < 0) + return error; + else if (error < 58) + return 1; + + if (osi_be32_to_cpu(p->sb_magicnum) != XFS_SB_MAGIC) + return 1; + + info->block_size = osi_be32_to_cpu(p->sb_blocksize); + + info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_XFS, 0, "XFS filesystem"); + + return 0; +} + + +/** * check_for_swap - check to see if SWAP is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not SWAP, - * 0 if SWAP found (with type set) */ -static int check_for_swap(int fd, char *type, unsigned type_len) +static check check_for_swap; +static int check_for_swap(int fd, struct device_info *info) { unsigned char buf[8192]; int error; @@ -227,7 +295,7 @@ if (memcmp(buf + 4086, "SWAP-SPACE", 10) && memcmp(buf + 4086, "SWAPSPACE2", 10)) return 1; - snprintf(type, type_len, "swap device"); + info_set(info, DEVICE_INFO_SWAP, 0, 0, "swap device"); return 0; } @@ -235,15 +303,10 @@ /** * check_for_lvm1 - check to see if LVM1 is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not LVM1, - * 0 if LVM1 found (with type set) */ -static int check_for_lvm1(int fd, char *type, unsigned type_len) +static check check_for_lvm1; +static int check_for_lvm1(int fd, struct device_info *info) { unsigned char buf[512]; int error; @@ -266,7 +329,7 @@ if (buf[0] != 'H' || buf[1] != 'M') return 1; - snprintf(type, type_len, "lvm1 subdevice"); + info_set_container(info, DEVICE_INFO_CONTAINER_LVM1, 0, "LVM1 subdevice"); return 0; } @@ -274,15 +337,10 @@ /** * check_for_lvm2 - check to see if LVM2 is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not LVM2, - * 0 if LVM1 found (with type set) */ -static int check_for_lvm2(int fd, char *type, unsigned type_len) +static check check_for_lvm2; +static int check_for_lvm2(int fd, struct device_info *info) { unsigned char buf[512]; int error; @@ -315,7 +373,7 @@ if (strncmp(&buf[24], "LVM2 001", 8) != 0) continue; - snprintf(type, type_len, "lvm2 subdevice"); + info_set_container(info, DEVICE_INFO_CONTAINER_LVM2, 0, "LVM1 subdevice"); return 0; } @@ -326,15 +384,10 @@ /** * check_for_cidev - check to see if CIDEV is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not CIDEV, - * 0 if CIDEV found (with type set) */ -static int check_for_cidev(int fd, char *type, unsigned type_len) +static check check_for_cidev; +static int check_for_cidev(int fd, struct device_info *info) { unsigned char buf[512]; uint32 *p = (uint32 *)buf; @@ -358,7 +411,7 @@ if (osi_be32_to_cpu(*p) != 0x47465341) return 1; - snprintf(type, type_len, "CIDEV"); + info_set_container(info, DEVICE_INFO_CONTAINER_CIDEV, 0, "CIDEV"); return 0; } @@ -366,15 +419,10 @@ /** * check_for_cca - check to see if CCA is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not CCA, - * 0 if CCA found (with type set) */ -static int check_for_cca(int fd, char *type, unsigned type_len) +static check check_for_cca; +static int check_for_cca(int fd, struct device_info *info) { unsigned char buf[512]; uint32 *p = (uint32 *)buf; @@ -398,7 +446,7 @@ if (osi_be32_to_cpu(*p) != 0x122473) return 1; - snprintf(type, type_len, "CCA device"); + info_set_container(info, DEVICE_INFO_CONTAINER_CCA, 0, "CCA device"); return 0; } @@ -406,15 +454,10 @@ /** * check_for_reiserfs - check to see if reisterfs is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not reiserfs, - * 0 if CCA found (with type set) */ -static int check_for_reiserfs(int fd, char *type, unsigned type_len) +static check check_for_reiserfs; +static int check_for_reiserfs(int fd, struct device_info *info) { unsigned int pass; uint64 offset; @@ -444,7 +487,7 @@ strncmp(buf + 52, "ReIsEr2Fs", 9) == 0 || strncmp(buf + 52, "ReIsEr3Fs", 9) == 0) { - snprintf(type, type_len, "Reiserfs filesystem"); + info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_REISERFS, 0, "ReiserFS filesystem"); return 0; } } @@ -453,69 +496,40 @@ } -/** - * identify_device - figure out what's on a device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * The offset of @fd will be changed by this function. - * This routine will not write to the device. - * - * Returns: -1 on error (with errno set), 1 if unabled to identify, - * 0 if device identified (with type set) - */ +static check *checks[] = +{ + check_for_partition_msdos, + check_for_pool, + check_for_lvm1, + check_for_lvm2, + check_for_cidev, + check_for_cca, + check_for_ext23, + check_for_gfs, + check_for_reiserfs, + check_for_xfs, + check_for_swap, +}; -int identify_device(int fd, char *type, unsigned type_len) +int identify_device(int fd, struct device_info *info) { - int error; + int i; - if (!type || !type_len) + if (!info) { errno = EINVAL; return -1; } - error = check_for_pool(fd, type, type_len); - if (error <= 0) - return error; + memset(info, sizeof (struct device_info), 0); - error = check_for_lvm1(fd, type, type_len); - if (error <= 0) - return error; + for (i = 0; i < sizeof (checks) / sizeof (*checks); ++i) + { + int error = checks[i](fd, info); + if (error <= 0) + return error; + } - error = check_for_lvm2(fd, type, type_len); - if(error <= 0) - return error; - - error = check_for_cidev(fd, type, type_len); - if (error <= 0) - return error; - - error = check_for_cca(fd, type, type_len); - if (error <= 0) - return error; - - error = check_for_gfs(fd, type, type_len); - if (error <= 0) - return error; - - error = check_for_ext23(fd, type, type_len); - if (error <= 0) - return error; - - error = check_for_reiserfs(fd, type, type_len); - if (error <= 0) - return error; - - error = check_for_swap(fd, type, type_len); - if (error <= 0) - return error; - - error = check_for_partition(fd, type, type_len); - if (error <= 0) - return error; - return 1; } -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: Digital signature URL: From agk at redhat.com Fri Apr 1 15:10:01 2005 From: agk at redhat.com (Alasdair G Kergon) Date: Fri, 1 Apr 2005 16:10:01 +0100 Subject: [Linux-cluster] [PATCH] iddev - convert to general purpose device identifier In-Reply-To: <20050401142031.GA21976@wavehammer.waldi.eu.org> References: <20050401142031.GA21976@wavehammer.waldi.eu.org> Message-ID: <20050401151001.GB14307@agk.surrey.redhat.com> On Fri, Apr 01, 2005 at 04:20:31PM +0200, Bastian Blank wrote: > I currently try to do some cleanups in lvm2/fsadm. This tool currently > relies on an entry in /etc/fstab to get the filesystem type and a > temporary mount to get the block size. With this changes I can just use > iddev to gather the informations. Also, if online and offline resizers are both available, fsadm should choose whichever is most appropriate according to whether the filesystem is already mounted or not. Alasdair -- agk at redhat.com From bastian at waldi.eu.org Fri Apr 1 16:02:10 2005 From: bastian at waldi.eu.org (Bastian Blank) Date: Fri, 1 Apr 2005 18:02:10 +0200 Subject: [Linux-cluster] [PATCH] iddev - convert to general purpose device identifier In-Reply-To: <20050401151001.GB14307@agk.surrey.redhat.com> References: <20050401142031.GA21976@wavehammer.waldi.eu.org> <20050401151001.GB14307@agk.surrey.redhat.com> Message-ID: <20050401160210.GA22564@wavehammer.waldi.eu.org> On Fri, Apr 01, 2005 at 04:10:01PM +0100, Alasdair G Kergon wrote: > Also, if online and offline resizers are both available, fsadm > should choose whichever is most appropriate according to whether > the filesystem is already mounted or not. Should be no real problem. Hmm, the two weeks are over and I don't got a statement to my patch. Bastian -- She won' go Warp 7, Cap'n! The batteries are dead! -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: Digital signature URL: From bastian at waldi.eu.org Fri Apr 1 16:07:03 2005 From: bastian at waldi.eu.org (Bastian Blank) Date: Fri, 1 Apr 2005 18:07:03 +0200 Subject: [Linux-cluster] [PATCH] iddev - convert to general purpose device identifier In-Reply-To: <20050401142031.GA21976@wavehammer.waldi.eu.org> References: <20050401142031.GA21976@wavehammer.waldi.eu.org> Message-ID: <20050401160703.GB22564@wavehammer.waldi.eu.org> Updated patch. Changes: - Use mmap. - Add the device size to the exported information. Bastian -- A princess should not be afraid -- not with a brave knight to protect her. -- McCoy, "Shore Leave", stardate 3025.3 -------------- next part -------------- === lib/iddev.h ================================================================== --- lib/iddev.h (/iddev/trunk) (revision 29) +++ lib/iddev.h (/iddev/local/branches/refactor) (revision 29) @@ -16,19 +16,65 @@ /** + * device_info - + */ + +enum device_info_family +{ + DEVICE_INFO_UNDEFINED_FAMILY = 0, + DEVICE_INFO_CONTAINER, + DEVICE_INFO_FILESYSTEM, + DEVICE_INFO_SWAP, +}; + +enum device_info_type +{ + DEVICE_INFO_UNDEFINED_TYPE = 0, + DEVICE_INFO_CONTAINER_CCA, + DEVICE_INFO_CONTAINER_CIDEV, + DEVICE_INFO_CONTAINER_LVM1, + DEVICE_INFO_CONTAINER_LVM2, + DEVICE_INFO_CONTAINER_PARTITION, + DEVICE_INFO_CONTAINER_POOL, + DEVICE_INFO_FILESYSTEM_EXT23, + DEVICE_INFO_FILESYSTEM_GFS, + DEVICE_INFO_FILESYSTEM_REISERFS, + DEVICE_INFO_FILESYSTEM_XFS, +}; + +enum device_info_subtype +{ + DEVICE_INFO_UNDEFINED_SUBTYPE = 0, + DEVICE_INFO_CONTAINER_PARTITION_MSDOS, + DEVICE_INFO_FILESYSTEM_EXT2, + DEVICE_INFO_FILESYSTEM_EXT3, +}; + +struct device_info +{ + enum device_info_family family; + enum device_info_type type; + enum device_info_subtype subtype; + + char display[128]; + unsigned char uuid[16]; + uint64_t device_size; + uint32_t block_size; +}; + +/** * indentify_device - figure out what's on a device * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type + * @info: a buffer * * The offset of @fd will be changed by the function. * This routine will not write to this device. * * Returns: -1 on error (with errno set), 1 if unabled to identify, - * 0 if device identified (with @type set) + * 0 if device identified (with @info set) */ -int identify_device(int fd, char *type, unsigned type_len); +int identify_device(int fd, struct device_info *info); /** @@ -39,7 +85,7 @@ * Returns: -1 on error (with errno set), 0 on success (with @bytes set) */ -int device_size(int fd, uint64 *bytes); +int device_size(int fd, uint64_t *bytes); #endif /* __IDDEV_DOT_H__ */ === lib/identify_device.c ================================================================== --- lib/identify_device.c (/iddev/trunk) (revision 29) +++ lib/identify_device.c (/iddev/local/branches/refactor) (revision 29) @@ -50,8 +50,8 @@ int main(int argc, char *argv[]) { int fd; - char buf[BUFSIZE]; - uint64 bytes; + struct device_info info; + const char *display; int error; prog_name = argv[0]; @@ -63,18 +63,16 @@ if (fd < 0) die("can't open %s: %s\n", argv[1], strerror(errno)); - error = identify_device(fd, buf, BUFSIZE); + error = identify_device(fd, &info); if (error < 0) die("error identifying the contents of %s: %s\n", argv[1], strerror(errno)); else if (error) - strcpy(buf, "unknown"); + display = "unknown"; + else + display = info.display; - error = device_size(fd, &bytes); - if (error < 0) - die("error determining the size of %s: %s\n", argv[1], strerror(errno)); - printf("%s:\n%-15s%s\n%-15s%"PRIu64"\n", - argv[1], " contents:", buf, " bytes:", bytes); + argv[1], " contents:", display, " bytes:", info.device_size); close(fd); === lib/iddev.c ================================================================== --- lib/iddev.c (/iddev/trunk) (revision 29) +++ lib/iddev.c (/iddev/local/branches/refactor) (revision 29) @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -25,48 +26,53 @@ #include "iddev.h" +static void info_set_display(struct device_info *info, const char *display) +{ + snprintf(info->display, sizeof (info->display), display); +} +static inline void info_set(struct device_info *info, const enum device_info_family family, const enum device_info_type type, const enum device_info_subtype subtype, const char *display) +{ + info->family = family; + info->type = type; + info->subtype = subtype; + info_set_display(info, display); +} +static inline void info_set_container(struct device_info *info, const enum device_info_type type, const enum device_info_subtype subtype, const char *display) +{ + info_set(info, DEVICE_INFO_CONTAINER, type, subtype, display); +} +static inline void info_set_filesystem(struct device_info *info, const enum device_info_type type, const enum device_info_subtype subtype, const char *display) +{ + info_set(info, DEVICE_INFO_FILESYSTEM, type, subtype, display); +} +typedef int check(const void *mem, size_t len, struct device_info *info); + /** * check_for_gfs - check to see if GFS is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * An EINVAL returned from lseek means that the device was too - * small -- at least on Linux. - * - * Returns: -1 on error (with errno set), 1 if not GFS, - * 0 if GFS found (with type set) */ -static int check_for_gfs(int fd, char *type, unsigned type_len) +enum { - unsigned char buf[512]; - uint32 *p = (uint32 *)buf; - int error; + GFS_OFFSET = 64*1024, + GFS_SB_SIZE = 512, +}; - error = lseek(fd, 65536, SEEK_SET); - if (error < 0) - return (errno == EINVAL) ? 1 : error; - else if (error != 65536) - { - errno = EINVAL; - return -1; - } +static check check_for_gfs; +static int check_for_gfs(const void *mem, size_t len, struct device_info *info) +{ + const uint32_t *p = (const uint32_t *)((const unsigned char *)mem + GFS_OFFSET); - error = read(fd, buf, 512); - if (error < 0) - return error; - else if (error < 8) + if (len < GFS_OFFSET + GFS_SB_SIZE) return 1; if (osi_be32_to_cpu(*p) != 0x01161970 || osi_be32_to_cpu(*(p + 1)) != 1) return 1; - snprintf(type, type_len, "GFS filesystem"); + info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_GFS, 0, "GFS filesystem"); return 0; } @@ -74,199 +80,186 @@ /** * check_for_pool - check to see if Pool is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not Pool, - * 0 if Pool found (with type set) */ -static int check_for_pool(int fd, char *type, unsigned type_len) +enum { - unsigned char buf[512]; - uint64 *p = (uint64 *)buf; - int error; + POOL_SB_SIZE = 512, +}; - error = lseek(fd, 0, SEEK_SET); - if (error < 0) - return error; - else if (error != 0) - { - errno = EINVAL; - return -1; - } +static check check_for_pool; +static int check_for_pool(const void *mem, size_t len, struct device_info *info) +{ + const uint64_t *p = mem; - error = read(fd, buf, 512); - if (error < 0) - return error; - else if (error < 8) + if (len < POOL_SB_SIZE) return 1; if (osi_be64_to_cpu(*p) != 0x11670) return 1; - snprintf(type, type_len, "Pool subdevice"); + info_set_container(info, DEVICE_INFO_CONTAINER_POOL, 0, "Pool subdevice"); return 0; } /** - * check_for_paritition - check to see if Partition is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not Partition, - * 0 if Partition found (with type set) + * check_for_partition_msdos - check to see if Partition is on this device */ -static int check_for_partition(int fd, char *type, unsigned type_len) +enum { - unsigned char buf[512]; - int error; + PARTITION_MSDOS_SB_SIZE = 512, +}; - error = lseek(fd, 0, SEEK_SET); - if (error < 0) - return error; - else if (error != 0) - { - errno = EINVAL; - return -1; - } +static check check_for_partition_msdos; +static int check_for_partition_msdos(const void *mem, size_t len, struct device_info *info) +{ + const unsigned char *buf = mem; - error = read(fd, buf, 512); - if (error < 0) - return error; - else if (error < 512) + if (len < PARTITION_MSDOS_SB_SIZE) return 1; if (buf[510] != 0x55 || buf[511] != 0xAA) return 1; - snprintf(type, type_len, "partition information"); + info_set_container(info, DEVICE_INFO_CONTAINER_PARTITION, DEVICE_INFO_CONTAINER_PARTITION_MSDOS, "MSDOS partition information"); return 0; } +enum +{ + EXT23_OFFSET = 1024, + EXT23_SB_SIZE = 512, + EXT23_BLOCK_SIZE_BITS = 10, + EXT23_BLOCK_SIZE = (1 << EXT23_BLOCK_SIZE_BITS), + EXT23_SUPER_MAGIC = 0xEF53, + EXT23_FEATURE_COMPAT_HAS_JOURNAL = 0x4, +}; +struct ext23_superblock +{ + uint32_t _r1[6]; /**< 0x00 - 0x14 */ + uint32_t s_log_block_size; /**< 0x18 */ + uint32_t _r2[7]; /**< 0x1c - 0x34 */ + uint16_t s_magic; /**< 0x38 */ + uint16_t s_state; /**< 0x3a */ + uint32_t _r3[8]; /**< 0x3c - 0x58 */ + uint32_t s_feature_compat; /**< 0x5c */ + uint32_t s_feature_incompat; /**< 0x60 */ + uint32_t s_feature_ro_compat; /**< 0x64 */ + uint8_t s_uuid[16]; /**< 0x68 - 0x77 */ +}; + /** * check_for_ext23 - check to see if EXT23 is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * An EINVAL returned from lseek means that the device was too - * small -- at least on Linux. - * - * Returns: -1 on error (with errno set), 1 if not EXT23, - * 0 if EXT23 found (with type set) */ -static int check_for_ext23(int fd, char *type, unsigned type_len) +static check check_for_ext23; +static int check_for_ext23(const void *mem, size_t len, struct device_info *info) { - unsigned char buf[512]; - uint16 *p = (uint16 *)buf; - int error; + const struct ext23_superblock *p = (const struct ext23_superblock *)((const unsigned char *)mem + EXT23_OFFSET); - error = lseek(fd, 1024, SEEK_SET); - if (error < 0) - return (errno == EINVAL) ? 1 : error; - else if (error != 1024) - { - errno = EINVAL; - return -1; - } + if (len < EXT23_OFFSET + EXT23_SB_SIZE) + return 1; - error = read(fd, buf, 512); - if (error < 0) - return error; - else if (error < 58) + if (osi_le16_to_cpu(p->s_magic) != EXT23_SUPER_MAGIC) return 1; - if (osi_le16_to_cpu(p[28]) != 0xEF53) + info->block_size = (EXT23_BLOCK_SIZE << osi_le32_to_cpu(p->s_log_block_size)); + + if (osi_le16_to_cpu(p->s_feature_compat) & EXT23_FEATURE_COMPAT_HAS_JOURNAL) + info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_EXT23, DEVICE_INFO_FILESYSTEM_EXT3, "EXT3 filesystem"); + else + info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_EXT23, DEVICE_INFO_FILESYSTEM_EXT2, "EXT2 filesystem"); + + return 0; +} + + +enum +{ + XFS_SB_SIZE = 512, + XFS_SB_MAGIC = 0x58465342, +}; + +struct xfs_superblock +{ + uint32_t sb_magicnum; + uint32_t sb_blocksize; + uint64_t sb_dblocks; + uint64_t sb_rblocks; + uint64_t sb_rextents; + uint8_t sb_uuid[16]; +}; + +/** + * check_for_xfs - check to see if XFS is on this device + */ + +static check check_for_xfs; +static int check_for_xfs(const void *mem, size_t len, struct device_info *info) +{ + const struct xfs_superblock *p = mem; + + if (len < XFS_SB_SIZE) return 1; - snprintf(type, type_len, "EXT2/3 filesystem"); + if (osi_be32_to_cpu(p->sb_magicnum) != XFS_SB_MAGIC) + return 1; + info->block_size = osi_be32_to_cpu(p->sb_blocksize); + + info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_XFS, 0, "XFS filesystem"); + return 0; } /** * check_for_swap - check to see if SWAP is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not SWAP, - * 0 if SWAP found (with type set) */ -static int check_for_swap(int fd, char *type, unsigned type_len) +static check check_for_swap; +static int check_for_swap(const void *mem, size_t len, struct device_info *info) { - unsigned char buf[8192]; - int error; + const unsigned char *buf = mem; - error = lseek(fd, 0, SEEK_SET); - if (error < 0) - return error; - else if (error != 0) - { - errno = EINVAL; - return -1; - } - - error = read(fd, buf, 8192); - if (error < 0) - return error; - else if (error < 4096) + if (len < 8192) return 1; if (memcmp(buf + 4086, "SWAP-SPACE", 10) && memcmp(buf + 4086, "SWAPSPACE2", 10)) return 1; - snprintf(type, type_len, "swap device"); + info_set(info, DEVICE_INFO_SWAP, 0, 0, "swap device"); return 0; } +enum +{ + LVM1_SB_SIZE = 512, +}; + /** * check_for_lvm1 - check to see if LVM1 is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not LVM1, - * 0 if LVM1 found (with type set) */ -static int check_for_lvm1(int fd, char *type, unsigned type_len) +static check check_for_lvm1; +static int check_for_lvm1(const void *mem, size_t len, struct device_info *info) { - unsigned char buf[512]; - int error; + const unsigned char *buf = mem; - error = lseek(fd, 0, SEEK_SET); - if (error < 0) - return error; - else if (error != 0) - { - errno = EINVAL; - return -1; - } - - error = read(fd, buf, 512); - if (error < 0) - return error; - else if (error < 2) + if (len < LVM1_SB_SIZE) return 1; if (buf[0] != 'H' || buf[1] != 'M') return 1; - snprintf(type, type_len, "lvm1 subdevice"); + info_set_container(info, DEVICE_INFO_CONTAINER_LVM1, 0, "LVM1 subdevice"); return 0; } @@ -274,39 +267,22 @@ /** * check_for_lvm2 - check to see if LVM2 is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not LVM2, - * 0 if LVM1 found (with type set) */ -static int check_for_lvm2(int fd, char *type, unsigned type_len) +static check check_for_lvm2; +static int check_for_lvm2(const void *mem, size_t len, struct device_info *info) { - unsigned char buf[512]; - int error; int i; + if (len < 6 * 512) + return 1; + /* LVM 2 labels can start in sectors 1-4 */ for (i = 1; i < 5; i++) { - error = lseek(fd, 512 * i, SEEK_SET); - if (error < 0) - return (errno == EINVAL) ? 1 : error; - else if (error != 512 * i) - { - errno = EINVAL; - return -1; - } + const unsigned char *buf = (const unsigned char *)mem + 512 * i; - error = read(fd, buf, 512); - if (error < 0) - return error; - else if (error < 32) - return 1; - if (strncmp(buf, "LABELONE", 8) != 0) continue; if (((uint64_t *)buf)[1] != i) @@ -315,7 +291,7 @@ if (strncmp(&buf[24], "LVM2 001", 8) != 0) continue; - snprintf(type, type_len, "lvm2 subdevice"); + info_set_container(info, DEVICE_INFO_CONTAINER_LVM2, 0, "LVM2 subdevice"); return 0; } @@ -324,127 +300,85 @@ } +enum +{ + CIDEV_SB_SIZE = 512, +}; + /** * check_for_cidev - check to see if CIDEV is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not CIDEV, - * 0 if CIDEV found (with type set) */ -static int check_for_cidev(int fd, char *type, unsigned type_len) +static check check_for_cidev; +static int check_for_cidev(const void *mem, size_t len, struct device_info *info) { - unsigned char buf[512]; - uint32 *p = (uint32 *)buf; - int error; + const uint32_t *p = mem; - error = lseek(fd, 0, SEEK_SET); - if (error < 0) - return error; - else if (error != 0) - { - errno = EINVAL; - return -1; - } - - error = read(fd, buf, 512); - if (error < 0) - return error; - else if (error < 4) + if (len < CIDEV_SB_SIZE) return 1; if (osi_be32_to_cpu(*p) != 0x47465341) return 1; - snprintf(type, type_len, "CIDEV"); + info_set_container(info, DEVICE_INFO_CONTAINER_CIDEV, 0, "CIDEV"); return 0; } +enum +{ + CCA_SB_SIZE = 512, +}; + /** * check_for_cca - check to see if CCA is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not CCA, - * 0 if CCA found (with type set) */ -static int check_for_cca(int fd, char *type, unsigned type_len) +static check check_for_cca; +static int check_for_cca(const void *mem, size_t len, struct device_info *info) { - unsigned char buf[512]; - uint32 *p = (uint32 *)buf; - int error; + const uint32_t *p = mem; - error = lseek(fd, 0, SEEK_SET); - if (error < 0) - return error; - else if (error != 0) - { - errno = EINVAL; - return -1; - } - - error = read(fd, buf, 512); - if (error < 0) - return error; - else if (error < 4) + if (len < CCA_SB_SIZE) return 1; if (osi_be32_to_cpu(*p) != 0x122473) return 1; - snprintf(type, type_len, "CCA device"); + info_set_container(info, DEVICE_INFO_CONTAINER_CCA, 0, "CCA device"); return 0; } +enum +{ + REISERFS_SB_SIZE = 65 * 1024, +}; + /** * check_for_reiserfs - check to see if reisterfs is on this device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * Returns: -1 on error (with errno set), 1 if not reiserfs, - * 0 if CCA found (with type set) */ -static int check_for_reiserfs(int fd, char *type, unsigned type_len) +static check check_for_reiserfs; +static int check_for_reiserfs(const void *mem, size_t len, struct device_info *info) { - unsigned int pass; - uint64 offset; - unsigned char buf[512]; - int error; + int pass; + if (len < REISERFS_SB_SIZE) + return 1; + for (pass = 0; pass < 2; pass++) { - offset = (pass) ? 65536 : 8192; + unsigned int offset = (pass) ? 65536 : 8192; + const unsigned char *p = (const unsigned char *)mem + offset; - error = lseek(fd, offset, SEEK_SET); - if (error < 0) - return (errno == EINVAL) ? 1 : error; - else if (error != offset) + if (strncmp(p + 52, "ReIsErFs", 8) == 0 || + strncmp(p + 52, "ReIsEr2Fs", 9) == 0 || + strncmp(p + 52, "ReIsEr3Fs", 9) == 0) { - errno = EINVAL; - return -1; - } - - error = read(fd, buf, 512); - if (error < 0) - return error; - else if (error < 62) - return 1; - - if (strncmp(buf + 52, "ReIsErFs", 8) == 0 || - strncmp(buf + 52, "ReIsEr2Fs", 9) == 0 || - strncmp(buf + 52, "ReIsEr3Fs", 9) == 0) - { - snprintf(type, type_len, "Reiserfs filesystem"); + info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_REISERFS, 0, "ReiserFS filesystem"); return 0; } } @@ -453,69 +387,49 @@ } -/** - * identify_device - figure out what's on a device - * @fd: a file descriptor open on a device open for (at least) reading - * @type: a buffer that contains the type of filesystem - * @type_len: the amount of space pointed to by @type - * - * The offset of @fd will be changed by this function. - * This routine will not write to the device. - * - * Returns: -1 on error (with errno set), 1 if unabled to identify, - * 0 if device identified (with type set) - */ +static check *checks[] = +{ + check_for_partition_msdos, + check_for_pool, + check_for_lvm1, + check_for_lvm2, + check_for_cidev, + check_for_cca, + check_for_ext23, + check_for_gfs, + check_for_reiserfs, + check_for_xfs, + check_for_swap, +}; -int identify_device(int fd, char *type, unsigned type_len) +int identify_device(int fd, struct device_info *info) { - int error; + int i; + const void *mem; + size_t len; - if (!type || !type_len) + if (!info) { errno = EINVAL; return -1; } - error = check_for_pool(fd, type, type_len); - if (error <= 0) - return error; + memset(info, sizeof (struct device_info), 0); - error = check_for_lvm1(fd, type, type_len); - if (error <= 0) - return error; + if (device_size(fd, &info->device_size) < 0) + return -1; - error = check_for_lvm2(fd, type, type_len); - if(error <= 0) - return error; + len = info->device_size <= 256*1024 ? info->device_size : 256*1024; - error = check_for_cidev(fd, type, type_len); - if (error <= 0) - return error; + mem = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0); - error = check_for_cca(fd, type, type_len); - if (error <= 0) - return error; + for (i = 0; i < sizeof (checks) / sizeof (*checks); ++i) + { + int error = checks[i](mem, len, info); + if (error <= 0) + return error; + } - error = check_for_gfs(fd, type, type_len); - if (error <= 0) - return error; - - error = check_for_ext23(fd, type, type_len); - if (error <= 0) - return error; - - error = check_for_reiserfs(fd, type, type_len); - if (error <= 0) - return error; - - error = check_for_swap(fd, type, type_len); - if (error <= 0) - return error; - - error = check_for_partition(fd, type, type_len); - if (error <= 0) - return error; - return 1; } -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: Digital signature URL: From cfeist at redhat.com Fri Apr 1 18:05:14 2005 From: cfeist at redhat.com (Chris Feist) Date: Fri, 01 Apr 2005 12:05:14 -0600 Subject: [Linux-cluster] Re: Kernel RPMS In-Reply-To: <20050331103719.ogbpvxyzcc8g0s4c@imp.rexursive.com> References: <64370.213.164.3.90.1111585743.squirrel@www.nodata.co.uk> <42487E5B.3000000@redhat.com> <20050331103719.ogbpvxyzcc8g0s4c@imp.rexursive.com> Message-ID: <424D8D5A.3070103@redhat.com> Bojan Smojver wrote: > So, the source kernels available for the current version of RHEL4 (i.e. > kernel-2.6.9-5.0.3.EL) should be used as a base version for patching > with the > above cluster stuff? I'm guessing the patches will then bring that > kernel (i.e. > the current shipping one) in line with the kernel all those (no longer > available) RPMS depended on? Or do we have to use the latest vanilla > kernels > from kernel.org? Or it doesn't really matter because all 2.6.x kernels > are OK? I believe that the HEAD is targetted to the latest vanilla kernels, that's what you'll want to build against. Thanks, Chris From bojan at rexursive.com Fri Apr 1 21:46:03 2005 From: bojan at rexursive.com (Bojan Smojver) Date: Sat, 02 Apr 2005 07:46:03 +1000 Subject: [Linux-cluster] Re: Kernel RPMS In-Reply-To: <424D8D5A.3070103@redhat.com> References: <64370.213.164.3.90.1111585743.squirrel@www.nodata.co.uk> <42487E5B.3000000@redhat.com> <20050331103719.ogbpvxyzcc8g0s4c@imp.rexursive.com> <424D8D5A.3070103@redhat.com> Message-ID: <1112391963.4676.0.camel@beast.rexursive.com> On Fri, 2005-04-01 at 12:05 -0600, Chris Feist wrote: > I believe that the HEAD is targetted to the latest vanilla kernels, that's > what you'll want to build against. OK, get it. That's why I probably get LOCK_USE_CLNT issues with all the tarballs... -- Bojan From lhh at redhat.com Sat Apr 2 00:04:19 2005 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 01 Apr 2005 19:04:19 -0500 Subject: [Linux-cluster] Need advice on cluster configuration In-Reply-To: <424C5C72.4010403@digitaldan.com> References: <424C5C72.4010403@digitaldan.com> Message-ID: <1112400259.24872.115.camel@ayanami.boston.redhat.com> On Thu, 2005-03-31 at 13:24 -0700, Daniel Cunningham wrote: > 1.whats the relationship between the raw devices used in the cluster > software (which can share raw networked storage w/out GFS?) and when > your using GFS on top of it (or are the two unrelated)? The two are more or less unrelated. Raw devices in clumanager are used for storing internal clumanager states, and have a minimum size of 10mb each. Components of GFS (CCA, pool volumes, file systems, etc.) can not be used atop of those two raw devices, but may be used atop of the same GNBD volume (with partitioning, of course!). > 2. rgmanager, how is this different from the cluster software's > fallback (failback?) domains and members taking over a service and a > related floating ip from a fallen member? > again thanks for your time rgmanager ~= clumanager+1 Rgmanager is similar to clumanager, but is a bit more modular: it supports on-the-fly reconfigurations of services and uses CMAN+DLM or gulm for the infrastructure instead of providing its own. -- Lon From pshearer at lumbermens.net Wed Apr 6 00:35:01 2005 From: pshearer at lumbermens.net (Peter Shearer) Date: Tue, 5 Apr 2005 17:35:01 -0700 Subject: [Linux-cluster] LOCK_DLM Performance under Fire Message-ID: <75FE40F00B17B344A490CAAEB6F2217F01A26E@lbcmail1.lumbermens.net> Hi, Everyone -- I've been playing around with RHEL 4 and GFS from the tar files (not CVS) on three OptiPlex GX280 workstations using hyperthreading, SATA drives, and GNBD for sharing over a 1Gb network (dual NICs per machine). I'm exploring moving a legacy file-based COBOL application/database over to Linux on a bunch of smaller boxes vs its current home of a quad proc AIX machine. I have a test application which basically does applies a bunch of file and record locks on and within files along with some processor intense sorting algorithms to stress test the power of the solution. I'm running into some serious performance discrepancies of which I hope someone can help me make sense. Here's what I'm running into when I test this app on different file systems: ext3 on local disk, the test app takes about 3 min 20 sec to complete. ext3 on GNBD exported disk (one node only, obviously); completes in about 3 min 35 sec. GFS on GNBD mounted with the localflocks option; completes in 5 min 30 sec. GFS on GNBD mounted using LOCK_DLM with only one server mounting the fs; completes in 50 min 45 sec. GFS on GNBD mounted using LOCK_DLM with two servers mounting the fs; went over 80 min and wasn't even half done. GFS on GNBD mounted using LOCK_GULM...don't want to go there; I left it running for over 2 hrs and it was worse off than the two servers using LOCK_DLM. :) The test app mostly does a whole lot of file & record level locking -- not a lot of file transfer from the source disk to the memory of the local server. iostat on the client and server both show that the transfer rate of data on and off the hard disk is only at about 300kBs. top shows that the cpu on the client is being beat up as the dlm_astd, lock_dlm1, and lock_dlm2 are taking on average 50% - 60% of the proc (30%, 15%, 15%) and my test app is taking up the rest. When it's running on ext3 or GFS mounted with localflocks, there isn't this problem at all -- the test app goes to 99% of cpu; hence the faster completion times. I have isolated the data paths so that the GNBD data is running over one NIC and the rest of the cluster data is on the second NIC in these computers. Anyone have some ideas on how to tune this? Would exporting the GNBD file system with caching enabled help as I'm not using multiple GNBD servers, just multiple GNBD clients? Other options? Am I just way off base here? Thanks! ________________________________________ Peter Shearer A+, MCSE, MCSE: Security, CCNA IT Network Engineer Lumbermens -------------- next part -------------- An HTML attachment was scrubbed... URL: From teigland at redhat.com Wed Apr 6 02:53:21 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 6 Apr 2005 10:53:21 +0800 Subject: [Linux-cluster] [PATCH] iddev - convert to general purpose device identifier In-Reply-To: <20050401160703.GB22564@wavehammer.waldi.eu.org> References: <20050401142031.GA21976@wavehammer.waldi.eu.org> <20050401160703.GB22564@wavehammer.waldi.eu.org> Message-ID: <20050406025321.GB6415@redhat.com> On Fri, Apr 01, 2005 at 06:07:03PM +0200, Bastian Blank wrote: > Updated patch. > > Changes: > - Use mmap. > - Add the device size to the exported information. Is libmagic standard enough to use instead of iddev? If not, then what about http://cvs.freedesktop.org/hal/hal/volume_id/ ? Then we could just get rid of iddev. -- Dave Teigland From teigland at redhat.com Wed Apr 6 03:47:39 2005 From: teigland at redhat.com (David Teigland) Date: Wed, 6 Apr 2005 11:47:39 +0800 Subject: [Linux-cluster] LOCK_DLM Performance under Fire In-Reply-To: <75FE40F00B17B344A490CAAEB6F2217F01A26E@lbcmail1.lumbermens.net> References: <75FE40F00B17B344A490CAAEB6F2217F01A26E@lbcmail1.lumbermens.net> Message-ID: <20050406034739.GC6415@redhat.com> On Tue, Apr 05, 2005 at 05:35:01PM -0700, Peter Shearer wrote: > ext3 on local disk, the test app takes about 3 min 20 sec to complete. > ext3 on GNBD exported disk (one node only, obviously); completes in > about 3 min 35 sec. > GFS on GNBD mounted with the localflocks option; completes in 5 min 30 > sec. > GFS on GNBD mounted using LOCK_DLM with only one server mounting the fs; > completes in 50 min 45 sec. > GFS on GNBD mounted using LOCK_DLM with two servers mounting the fs; > went over 80 min and wasn't even half done. It sounds like the app is using fcntl (posix) locks, not flock(2)? If so, that's a weak spot for lock_dlm which translates posix-lock requests into multiple dlm lock operations. That said, it's possible the code may be doing some dumb things that could be fixed to improve the speed. If there are hundreds of files being locked, one simple thing to try is to increase SHRINK_CACHE_COUNT and SHRINK_CACHE_MAX in lock_dlm.h (sorry, never made them tunable through proc.) This relates to some basic caching lock_dlm does for files that are repeatedly locked/unlocked. If the app could get by with just using flock() that would certainly be much faster. Also, if you could provide the test you use or a simplified equivalent it would help. -- Dave Teigland From bastian at waldi.eu.org Wed Apr 6 07:53:13 2005 From: bastian at waldi.eu.org (Bastian Blank) Date: Wed, 6 Apr 2005 09:53:13 +0200 Subject: [Linux-cluster] [PATCH] iddev - convert to general purpose device identifier In-Reply-To: <20050406025321.GB6415@redhat.com> References: <20050401142031.GA21976@wavehammer.waldi.eu.org> <20050401160703.GB22564@wavehammer.waldi.eu.org> <20050406025321.GB6415@redhat.com> Message-ID: <20050406075313.GA6054@wavehammer.waldi.eu.org> On Wed, Apr 06, 2005 at 10:53:21AM +0800, David Teigland wrote: > Is libmagic standard enough to use instead of iddev? It returns a string, maybe a little bit too loose. > If not, then > what about http://cvs.freedesktop.org/hal/hal/volume_id/ ? Returns enough information for the gfs part, name of the filesystem and uuid, not enough for what I want to use it (block size, filesystem size). Bastian -- Worlds are conquered, galaxies destroyed -- but a woman is always a woman. -- Kirk, "The Conscience of the King", stardate 2818.9 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: Digital signature URL: From pshearer at lumbermens.net Wed Apr 6 19:01:02 2005 From: pshearer at lumbermens.net (Peter Shearer) Date: Wed, 6 Apr 2005 12:01:02 -0700 Subject: [Linux-cluster] LOCK_DLM Performance under Fire Message-ID: <75FE40F00B17B344A490CAAEB6F2217F01A26F@lbcmail1.lumbermens.net> Ick...it appears the apps's locking mechanism is fnctl. An strace off the app is full of... fcntl64(8, F_SETLK64, {type=F_UNLCK, whence=SEEK_SET, start=2147478526, len=1024}, 0xbffff5a0) = 0 fcntl64(8, F_SETLK64, {type=F_WRLCK, whence=SEEK_SET, start=2147477478, len=1}, 0xbffff4f0) = 0 ...type messages. The app itself is a really old COBOL app built on Liant's RM/Cobol -- an abstraction software similar to java which allows the same object code to run on Linux, UNIX, and Windows with very little modification through a runtime application. So, while I have access to the source for the compiled object, I don't have access to the runtime app code, which is really the thing doing all the locking. This specific testing app is opening one file with locks, but it's beating that file up. Essentially, it's going through the file and performing a series of sorts and searches, which, for the most part, would beat up the proc more than the I/O. The "real" application for the most part will not be nearly as intense, but will open probably around 100 shared files simultaneously with posix locking. Would adjusting the SHRINK_CACHE_COUNT and SHRINK_CACHE_MAX in lock_dlm.h affect this type of application? Any other tunable parameters which will help out? I'm not tied to DLM at this point...is there another mechanism which would do this equally well? As for a test app...I'm not sure I'll be able to provide that. I'll look into it, though. --Peter -----Original Message----- From: David Teigland [mailto:teigland at redhat.com] Sent: Tuesday, April 05, 2005 8:48 PM To: Peter Shearer Cc: linux-cluster at redhat.com Subject: Re: [Linux-cluster] LOCK_DLM Performance under Fire On Tue, Apr 05, 2005 at 05:35:01PM -0700, Peter Shearer wrote: > ext3 on local disk, the test app takes about 3 min 20 sec to complete. > ext3 on GNBD exported disk (one node only, obviously); completes in > about 3 min 35 sec. > GFS on GNBD mounted with the localflocks option; completes in 5 min 30 > sec. > GFS on GNBD mounted using LOCK_DLM with only one server mounting the fs; > completes in 50 min 45 sec. > GFS on GNBD mounted using LOCK_DLM with two servers mounting the fs; > went over 80 min and wasn't even half done. It sounds like the app is using fcntl (posix) locks, not flock(2)? If so, that's a weak spot for lock_dlm which translates posix-lock requests into multiple dlm lock operations. That said, it's possible the code may be doing some dumb things that could be fixed to improve the speed. If there are hundreds of files being locked, one simple thing to try is to increase SHRINK_CACHE_COUNT and SHRINK_CACHE_MAX in lock_dlm.h (sorry, never made them tunable through proc.) This relates to some basic caching lock_dlm does for files that are repeatedly locked/unlocked. If the app could get by with just using flock() that would certainly be much faster. Also, if you could provide the test you use or a simplified equivalent it would help. -- Dave Teigland From teigland at redhat.com Thu Apr 7 02:30:37 2005 From: teigland at redhat.com (David Teigland) Date: Thu, 7 Apr 2005 10:30:37 +0800 Subject: [Linux-cluster] LOCK_DLM Performance under Fire In-Reply-To: <75FE40F00B17B344A490CAAEB6F2217F01A26F@lbcmail1.lumbermens.net> References: <75FE40F00B17B344A490CAAEB6F2217F01A26F@lbcmail1.lumbermens.net> Message-ID: <20050407023037.GA6615@redhat.com> On Wed, Apr 06, 2005 at 12:01:02PM -0700, Peter Shearer wrote: > The app itself is a really old COBOL app built on Liant's RM/Cobol -- an > abstraction software similar to java which allows the same object code > to run on Linux, UNIX, and Windows with very little modification through > a runtime application. So, while I have access to the source for the > compiled object, I don't have access to the runtime app code, which is > really the thing doing all the locking. > > This specific testing app is opening one file with locks, but it's > beating that file up. Essentially, it's going through the file and > performing a series of sorts and searches, which, for the most part, > would beat up the proc more than the I/O. The "real" application for > the most part will not be nearly as intense, but will open probably > around 100 shared files simultaneously with posix locking. Would > adjusting the SHRINK_CACHE_COUNT and SHRINK_CACHE_MAX in lock_dlm.h > affect this type of application? Any other tunable parameters which > will help out? I'm not tied to DLM at this point...is there another > mechanism which would do this equally well? Taking a step back, is this a parallelized/clusterized application? i.e. will it be running concurrently on different machines with the data shared using GFS? If so, then the distributed fcntl locks are critical. If not, it would be safe to use the localflocks mount option which means fcntl locks are no longer translated to distributed locks. -- Dave Teigland From serge at triumvirat.ru Thu Apr 7 12:02:32 2005 From: serge at triumvirat.ru (Sergey) Date: Thu, 7 Apr 2005 16:02:32 +0400 Subject: [Linux-cluster] problems with gfs locking Message-ID: <341716745.20050407160232@triumvirat.ru> Hello everybody! Please, someone help me with a huge problem. We have two servers HP DL380G4 connected to HP MSA500 (Modular Smart Array with currently installed 3 disks as RAID-5 summary volume of 274G). Servers works under Red Hat Enterprise Linux, data storage is formatted to GFS. Two months system with 2 nodes works fine. But two weeks ago we started experiencing problems with system load. Symptoms are as follows: 1. Server on which httpd is running become unstable because of increasing of simultaneously running processes - uptime shows numbers 10, 20,..., 120, 160 in few minutes, top hangs after this number is big enough. If run ps to see httpd processes, all of them will be with status D (uninterruptible sleep) - so Apache runs MaxClients processes every of them never ends. I can't kill none of them and they are locked with high probability by GFS - there are two processes gulm_Cb_Handler both taking about 100% of CPU usage. 2. Apache server-status shows that almost every process hangs with status W (sending reply), MySQL shows that lot of connections are open (each script in auto-prepend file opens connection) but they are sleeping. Apache document_root points to GFS raid, so every http-request causes filesystem to read or write files (users activity was about 8 Gb in 10000 files in last month, which is twice as much in previous month, when system seemed stable). Now filesystem is used at 15% (about 40Gb of 274Gb), the biggest folder contains over 30000 files - may be this is the reason of problems, like when quantity turns into (low) quality. 3. Another reason which caused locking of filesystem is cvs, which goes over all of that thousands of files. But this can not be repeated - only few times cvs hanged while updating (in fact, checking) some folders (not very big sometimes). 4. Traffic diagram (by MRTG) shows that when GFS going down there are suspicious spikes of activity on network interface which is used to link GFS nodes raising up to 4 Mbits/sec (while average throughput is about 100 kbits/sec) in both sides. We assume that our problems started when we changed link between two nodes from plain patch cord to Cisco Catalyst switch (which may have only 10 Mbits/sec througput). Can slow network be the reason of our troubles? And another question - does journals synchronizes or is there any other activity between two nodes while reading data from GFS on one of them? Thanks for any qualified answers. -- Sergey From pshearer at lumbermens.net Thu Apr 7 16:40:34 2005 From: pshearer at lumbermens.net (Peter Shearer) Date: Thu, 7 Apr 2005 09:40:34 -0700 Subject: [Linux-cluster] LOCK_DLM Performance under Fire Message-ID: <75FE40F00B17B344A490CAAEB6F2217F2159AE@lbcmail1.lumbermens.net> Yes, the idea was to parallelize the app across multiple machines sharing a common SAN infrastructure (hopefully iSCSI; if not, then GNBD in the interim). There is no central control daemon or database manager; each instance of the app does its own record locking and such, so it really doesn't matter where the data resides, as long as all the clients are able to touch the same files. Therefore, distributed locks are really important. I had suspected that the locking subsys was causing the slowdowns, so that's why I did a test with the localflocks -- it's not as fast as ext3, but works fine with only one server involved. Of course, that's not going to work for this application. :) --Peter -----Original Message----- From: David Teigland [mailto:teigland at redhat.com] Sent: Wednesday, April 06, 2005 7:31 PM To: Peter Shearer Cc: linux-cluster at redhat.com Subject: Re: [Linux-cluster] LOCK_DLM Performance under Fire On Wed, Apr 06, 2005 at 12:01:02PM -0700, Peter Shearer wrote: > The app itself is a really old COBOL app built on Liant's RM/Cobol -- an > abstraction software similar to java which allows the same object code > to run on Linux, UNIX, and Windows with very little modification through > a runtime application. So, while I have access to the source for the > compiled object, I don't have access to the runtime app code, which is > really the thing doing all the locking. > > This specific testing app is opening one file with locks, but it's > beating that file up. Essentially, it's going through the file and > performing a series of sorts and searches, which, for the most part, > would beat up the proc more than the I/O. The "real" application for > the most part will not be nearly as intense, but will open probably > around 100 shared files simultaneously with posix locking. Would > adjusting the SHRINK_CACHE_COUNT and SHRINK_CACHE_MAX in lock_dlm.h > affect this type of application? Any other tunable parameters which > will help out? I'm not tied to DLM at this point...is there another > mechanism which would do this equally well? Taking a step back, is this a parallelized/clusterized application? i.e. will it be running concurrently on different machines with the data shared using GFS? If so, then the distributed fcntl locks are critical. If not, it would be safe to use the localflocks mount option which means fcntl locks are no longer translated to distributed locks. -- Dave Teigland From daniel at osdl.org Tue Apr 12 00:13:06 2005 From: daniel at osdl.org (Daniel McNeil) Date: Mon, 11 Apr 2005 17:13:06 -0700 Subject: [Linux-cluster] test hung after 36 hours Message-ID: <1113264786.31312.16.camel@ibm-c.pdx.osdl.net> I started my mount/tar/rm/ tests on Apr 4 17:41 and I hit a problem at Apr 6 05:30. So the test ran for 36 hours. cl030 and cl031 were getting "SM: process_reply invalid" messages and cl032 got "No response" and "Missed too many heartbeats" cl032: [-- MARK -- Wed Apr 6 05:15:00 2005] CMAN: removing node cl030a from the cluster : Missed too many heartbeats CMAN: removing node cl031a from the cluster : No response to messages CMAN: quorum lost, blocking activity [-- MARK -- Wed Apr 6 05:30:00 2005] GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs" cl030: [-- MARK -- Wed Apr 6 05:15:00 2005] CMAN: removing node cl032a from the cluster : Missed too many heartbeats GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs" GFS: fsid=gfs_cluster:stripefs.0: Joined cluster. Now mounting FS... GFS: fsid=gfs_cluster:stripefs.0: jid=0: Trying to acquire journal lock... GFS: fsid=gfs_cluster:stripefs.0: jid=0: Looking at journal... GFS: fsid=gfs_cluster:stripefs.0: jid=0: Done GFS: fsid=gfs_cluster:stripefs.0: jid=1: Trying to acquire journal lock... GFS: fsid=gfs_cluster:stripefs.0: jid=1: Looking at journal... GFS: fsid=gfs_cluster:stripefs.0: jid=1: Done GFS: fsid=gfs_cluster:stripefs.0: jid=2: Trying to acquire journal lock... GFS: fsid=gfs_cluster:stripefs.0: jid=2: Looking at journal... GFS: fsid=gfs_cluster:stripefs.0: jid=2: Done GFS: fsid=gfs_cluster:stripefs.0: jid=3: Trying to acquire journal lock... GFS: fsid=gfs_cluster:stripefs.0: jid=3: Looking at journal... GFS: fsid=gfs_cluster:stripefs.0: jid=3: Done SM: process_reply invalid id=20496 nodeid=4294967295 SM: process_reply invalid id=20497 nodeid=4294967295 cl031: [-- MARK -- Wed Apr 6 05:15:00 2005] SM: process_reply invalid id=20496 nodeid=4294967295 SM: process_reply invalid id=20496 nodeid=4294967295 SM: process_reply invalid id=20496 nodeid=4294967295 SM: process_reply invalid id=20497 nodeid=4294967295 SM: process_reply invalid id=20497 nodeid=4294967295 SM: process_reply invalid id=20497 nodeid=4294967295 SM: process_reply invalid id=20500 nodeid=4294967295 SM: process_reply invalid id=20500 nodeid=4294967295 SM: process_reply invalid id=20500 nodeid=4294967295 SM: process_reply invalid id=20501 nodeid=4294967295 SM: process_reply invalid id=20501 nodeid=4294967295 SM: process_reply invalid id=20501 nodeid=4294967295 SM: process_reply invalid id=20504 nodeid=4294967295 SM: process_reply invalid id=20504 nodeid=4294967295 SM: process_reply invalid id=20504 nodeid=4294967295 GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs" SM: process_reply invalid id=20505 nodeid=4294967295 GFS: fsid=gfs_cluster:stripefs.1: Joined cluster. Now mounting FS... A bit more info is available here. http://developer.osdl.org/daniel/GFS/test.04apr2005/ Any ideas on what is going on? Daniel From teigland at redhat.com Tue Apr 12 03:30:26 2005 From: teigland at redhat.com (David Teigland) Date: Tue, 12 Apr 2005 11:30:26 +0800 Subject: [Linux-cluster] test hung after 36 hours In-Reply-To: <1113264786.31312.16.camel@ibm-c.pdx.osdl.net> References: <1113264786.31312.16.camel@ibm-c.pdx.osdl.net> Message-ID: <20050412033026.GB7350@redhat.com> On Mon, Apr 11, 2005 at 05:13:06PM -0700, Daniel McNeil wrote: > I started my mount/tar/rm/ tests on Apr 4 17:41 and I hit > a problem at Apr 6 05:30. So the test ran for 36 hours. > cl030 and cl031 were getting "SM: process_reply invalid" > messages and cl032 got "No response" and "Missed too many > heartbeats" The SM messages are an effect of CMAN removing nodes. There's a fair chance that this recent fix will help: http://sources.redhat.com/ml/cluster-cvs/2005-q2/msg00018.html -- Dave Teigland From Hansjoerg.Maurer at dlr.de Wed Apr 13 06:32:24 2005 From: Hansjoerg.Maurer at dlr.de (Hansjoerg Maurer) Date: Wed, 13 Apr 2005 08:32:24 +0200 Subject: [Linux-cluster] Iozone tests on gfs with 29th march gfs snapshot Message-ID: <425CBCF8.3090908@dlr.de> Hi, we are planning a small cluster (3 nodes in the SAN and 13 nodes with gnbd) and I have tried gfs 6.1 from 29th march on RHEL testkernel 2.6.9-6.36.ELsmp x64 (elevator=deadline) Opteron x64 Hardware (we want to use RHEL4 based systems, because our cluster application performes much better with RHEL4 gcc on the opteron Hardware compared to RHEL3 gcc) Installation was fine :-) We have done some iozone runs on a local disk (SAN hardware is not avaliable yet) with - gfs and lock_dlm - gfs mounted with localcache - ext3 - gnbd (test on remote host) - nfs (test on remote host) The avaliable memory of the computers was reduced to 1G to speed up the test. There are some interesting points: - gfs seems to perform nearly as good as ext3 only with reclen 1024 and during write - gfs read performance seems to be not very good (are there any flags to improve it?) - there seems to be no big difference in mount-option with localcache and using lock_dlm - gnbd's write performance seems to be better as nfs - nfs read performance seems to be better as gnbd's - running two gnbd tests in parallel reduces read performance dramatically (may be an hardware issue, because it seems to be the same with nfs) We want to use he cluster filesystem mostly to read and proceed 2 GB datasets. So we will test it with an application soon, which ist not a synthetic benchmark. Is there any prefered elevator, one should use with gfs? (I will try some tests this evening) It would be nice, if anyone with gfs experience could comment on the results. We will have the hardware avaliable for testing until the mid of next week, though if someone wants me to try some other configurations (including current CVS) give me a note. Thank you very much Hansj?rg -- Ext3 Run began: Tue Apr 12 20:55:54 2005 Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. Auto Mode Command line used: /opt/iozone/bin/iozone -n 2G -g 2G -a -i 0 -i 1 -i 2 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 2097152 64 56657 55876 38324 37653 11437 20542 2097152 128 59898 56835 38341 38075 19130 33323 2097152 256 61116 57787 35269 38075 15809 43095 2097152 512 61501 58274 37937 38266 23262 39051 2097152 1024 58998 53493 37869 35665 37163 38920 2097152 2048 58333 58310 35291 38273 47074 38147 2097152 4096 60143 60270 37873 38445 54552 37152 2097152 8192 57725 50131 37767 38112 54441 37164 2097152 16384 53178 57389 38317 38283 58545 34425 iozone test complete. -- GFS lock_dlm Run began: Tue Apr 12 19:16:51 2005 Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. Auto Mode Command line used: /opt/iozone/bin/iozone -n 2G -g 2G -a -i 0 -i 1 -i 2 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 2097152 64 37142 37320 24185 25787 7972 22459 2097152 128 34475 34459 25381 25408 13556 21645 2097152 256 33592 33405 25220 25222 11011 21268 2097152 512 32269 31527 25081 25308 14364 23356 2097152 1024 57865 35774 24190 24919 23027 28430 2097152 2048 41923 34880 25364 24216 29515 31278 2097152 4096 46328 35647 24187 25127 32752 30084 2097152 8192 32296 34830 25172 24000 35714 31385 2097152 16384 40136 38008 24386 25317 39407 37255 iozone test complete. -- GFS mounted with localcache Run began: Tue Apr 12 19:17:53 2005 Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. Auto Mode Command line used: /opt/iozone/bin/iozone -n 2G -g 2G -a -i 0 -i 1 -i 2 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 2097152 64 35995 35988 24151 26080 8119 21277 2097152 128 31582 35475 25624 25818 13604 22245 2097152 256 30484 34576 25344 25073 11117 21686 2097152 512 29355 35323 25542 25107 14942 24594 2097152 1024 33213 32260 24896 25344 22427 26454 2097152 2048 34852 36949 24417 25377 30143 31454 2097152 4096 42722 33431 24978 24774 32416 31650 2097152 8192 43942 32786 25752 24461 36606 33237 2097152 16384 32568 33575 25072 25506 38057 32971 - NFS mounted ext3 Run began: Tue Apr 12 22:08:04 2005 Include close in write timing Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. Auto Mode Command line used: /opt/iozone/bin/iozone -c -n 2G -g 2G -a -i 0 -i 1 -i 2 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 2097152 64 29933 24934 34428 34329 11396 5095 2097152 128 29848 24387 33693 33815 18280 5822 2097152 256 30033 25856 33915 33520 24666 7303 2097152 512 29868 26192 33932 32201 16207 7797 2097152 1024 30857 25485 31165 32378 27212 9730 2097152 2048 28617 24478 33258 35049 41341 10497 2097152 4096 29514 25804 33221 31459 49188 9050 2097152 8192 30777 25721 32443 32264 46874 8502 2097152 16384 28470 25419 34607 34286 57369 8056 iozone test complete. - GNBD mounted GFS Run began: Tue Apr 12 19:16:51 2005 Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. Auto Mode Command line used: /opt/iozone/bin/iozone -n 2G -g 2G -a -i 0 -i 1 -i 2 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 2097152 64 37142 37320 24185 25787 7972 22459 2097152 128 34475 34459 25381 25408 13556 21645 2097152 256 33592 33405 25220 25222 11011 21268 2097152 512 32269 31527 25081 25308 14364 23356 2097152 1024 57865 35774 24190 24919 23027 28430 2097152 2048 41923 34880 25364 24216 29515 31278 2097152 4096 46328 35647 24187 25127 32752 30084 2097152 8192 32296 34830 25172 24000 35714 31385 2097152 16384 40136 38008 24386 25317 39407 37255 iozone test complete. --GNBD mounted GFS (2 simultanous runs und 2 gnbd clients) Run began: Tue Apr 12 22:33:15 2005 Using minimum file size of 2097152 kilobytes. Using maximum file size of 2097152 kilobytes. Auto Mode Command line used: /opt/iozone/bin/iozone -n 2G -g 2G -a -i 0 -i 1 -i 2 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 2097152 64 21310 20580 9275 10662 3472 10022 2097152 128 11953 17971 8171 10598 4633 17555 2097152 256 15856 22844 5131 11313 4157 25773 2097152 512 15887 28725 4182 12117 7046 27678 2097152 1024 28444 26170 3780 10233 9771 30745 2097152 2048 38449 27667 4042 10059 12872 30426 2097152 4096 27096 35431 4699 11047 15139 31802 2097152 8192 29305 37275 4696 9728 14856 39259 2097152 16384 26824 49085 8836 5279 16865 57405 iozone test complete. -- _________________________________________________________________ Dr. Hansjoerg Maurer | LAN- & System-Manager | Deutsches Zentrum | DLR Oberpfaffenhofen f. Luft- und Raumfahrt e.V. | Institut f. Robotik | Postfach 1116 | Muenchner Strasse 20 82230 Wessling | 82234 Wessling Germany | | Tel: 08153/28-2431 | E-mail: Hansjoerg.Maurer at dlr.de Fax: 08153/28-1134 | WWW: http://www.robotic.dlr.de/ __________________________________________________________________ There are 10 types of people in this world, those who understand binary and those who don't. From birger at birger.sh Wed Apr 13 12:18:52 2005 From: birger at birger.sh (birger) Date: Wed, 13 Apr 2005 14:18:52 +0200 Subject: [Linux-cluster] Problems compiling cluster software on fedora core 3 Message-ID: <425D0E2C.20102@birger.sh> I fetched the cluster sources using cvs today. I have tried compiling them on Fedora Core 3 I used ./configure --kernel=/lib/modules/2.6.11-1.14_FC3/build to compile without installing full kernel source. First I had to edit cluster/cman/lib/libcman.c and change #include into #include Then I ran into problems with cmirror wanting dm-log.h and dm-io.h. I found these in the device-manager source, but compilation then fails with syntax errors. Can someone give some advice on how to compile and install this? -- birger From jbrassow at redhat.com Wed Apr 13 14:01:11 2005 From: jbrassow at redhat.com (Jonathan E Brassow) Date: Wed, 13 Apr 2005 09:01:11 -0500 Subject: [Linux-cluster] Problems compiling cluster software on fedora core 3 In-Reply-To: <425D0E2C.20102@birger.sh> References: <425D0E2C.20102@birger.sh> Message-ID: On Apr 13, 2005, at 7:18 AM, birger wrote: > Then I ran into problems with cmirror wanting dm-log.h and dm-io.h. I > found these in the device-manager source, but compilation then fails > with syntax errors. I shouldn't even be compiling this from the top level. It's not ready and there needs to be accompanying device-mapper changes. Please forgive. I think cmirror is the last thing to compile, so if you ignore the error the rest should have installed fine. If you don't like the errors, you can comment out cmirror in the makefile - which is what I'm going to do right now. brassow From hansjoerg.maurer at dlr.de Wed Apr 13 18:35:40 2005 From: hansjoerg.maurer at dlr.de (=?ISO-8859-1?Q?Hansj=F6rg_Maurer?=) Date: Wed, 13 Apr 2005 20:35:40 +0200 Subject: [Linux-cluster] GNBD multipath with devicemapper? Message-ID: <425D667C.2050701@dlr.de> Hi I am trying to set up gnbd with multipath. Accoding to the gnbd_usage.txt file, I understand, that this should work with dm-multipath. But unfortunatly only the gfs part of the setup is descriped there. Has anybody experiance with this setup, especially how to set up multipath with multiple /dev/gnbd* and how to setup the multipath.conf file Thank you very much Hansj?rg Maurer From daniel at osdl.org Wed Apr 13 21:56:08 2005 From: daniel at osdl.org (Daniel McNeil) Date: Wed, 13 Apr 2005 14:56:08 -0700 Subject: [Linux-cluster] oops after 12 hours during umount In-Reply-To: <20050412033026.GB7350@redhat.com> References: <1113264786.31312.16.camel@ibm-c.pdx.osdl.net> <20050412033026.GB7350@redhat.com> Message-ID: <1113429368.31312.39.camel@ibm-c.pdx.osdl.net> On Mon, 2005-04-11 at 20:30, David Teigland wrote: > On Mon, Apr 11, 2005 at 05:13:06PM -0700, Daniel McNeil wrote: > > I started my mount/tar/rm/ tests on Apr 4 17:41 and I hit > > a problem at Apr 6 05:30. So the test ran for 36 hours. > > cl030 and cl031 were getting "SM: process_reply invalid" > > messages and cl032 got "No response" and "Missed too many > > heartbeats" > > The SM messages are an effect of CMAN removing nodes. There's a fair > chance that this recent fix will help: > http://sources.redhat.com/ml/cluster-cvs/2005-q2/msg00018.html Good news and bad news. Good news: I think my previous problem was an network upgrade that accidentally cut off one of my nodes. Bad news: after upgrading to the latest cvs I hit an oops after 12 hours. The below looks life we are accessing freed memory. I have slab debug and spin lock debug configured. Here's the oops: Unable to handle kernel paging request at virtual address 6b6b6bbf printing eip: c03e8682 *pde = 00000000 Oops: 0002 [#1] PREEMPT SMP Modules linked in: lock_dlm dlm gfs lock_harness cman qla2200 qla2xxx dm_mod video CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010246 (2.6.11) EIP is at _spin_lock+0x22/0x90 eax: 00000000 ebx: 6b6b6bbf ecx: 00000001 edx: cdc82000 esi: cdc82000 edi: 6b6b6bbf ebp: cdc82ea4 esp: cdc82e9c ds: 007b es: 007b ss: 0068 Process umount (pid: 14022, threadinfo=cdc82000 task=cc113a60) Stack: d2bee958 d2beea7c cdc82ebc c0162f06 d2bee958 d2bee968 d2bee958 6b6b6b6b cdc82edc c017bb24 d2bee958 00004192 00000001 cdc82eec ce844050 f90314e0 cdc82efc c017bc14 cbd665d0 cdc82eec d2bee4ec cbe47b3c cbd66544 ce844050 Call Trace: [] show_stack+0x7f/0xa0 [] show_registers+0x162/0x1e0 [] die+0xfe/0x190 [] do_page_fault+0x3b2/0x6f2 [] error_code+0x2b/0x30 [] invalidate_inode_buffers+0x46/0x90 [] invalidate_list+0x44/0xe0 [] invalidate_inodes+0x54/0x90 [] generic_shutdown_super+0x74/0x140 [] gfs_kill_sb+0x2e/0x69 [gfs] [] deactivate_super+0x81/0xa0 [] sys_umount+0x3c/0xa0 [] sys_oldumount+0x19/0x20 [] sysenter_past_esp+0x52/0x75 Code: 00 00 00 8d bf 00 00 00 00 55 89 e5 83 ec 08 89 1c 24 89 c3 b8 01 00 00 00 89 74 24 04 e8 47 06 d3 ff be 00 f0 ff ff 21 e6 31 c0 <86> 03 84 c0 7e 0b 8b 1c 24 8b 74 24 04 89 ec 5d c3 b8 01 00 00 Daniel From birger at birger.sh Thu Apr 14 07:53:36 2005 From: birger at birger.sh (birger) Date: Thu, 14 Apr 2005 09:53:36 +0200 Subject: [Linux-cluster] Configuration of a 2 node HA cluster with gfs Message-ID: <425E2180.6060609@birger.sh> I think I sent this with a wrong sender address, as I didn't see it apear on the list. I have solved the problem with starting ccsd in my original message (if it ever went out) I have just installed the cluster package, and I am now looking for some help on how to use it :-) I have a lot of experience with Veritas FirstWatch and some with SunCluster, so I am not new to HA services. Now I have this server that I have to get up and running as quickly as possible... And I find little documentation about how to get this software up and running from a fresh install. I have one old file server with external scsi disks, and one new server with a SCSI-attached Nexsan ATAboy RAID array. I want to set up the new file server as half of a 2-node cluster and get it into production. Then move over data (and disks) from the old server until I can reinstall that one as the second cluster node. I first thought along the lines I am used to from Solaris and clustering. I wanted to set up 2 services: 1 NFS service that would take the disk with it when it moved, and 1 samba service that NFS-mounted the disk from the nfs service. After looking at the redhat stuff I am thinking: - Mount the disks permanently on both nodes using gfs (less chance of nuking the file systems because of a split-brain) - Perhaps also run NFS services permanently on both nodes, failing over only the IP address of the official NFS service. Should make failover even faster, but are there pitfalls to running multiple NFS servers off the same gfs file system? In addition to failing over the IP address, I would have to look into how to take along NFS file locks when doing a takeover. - samba running as a service that fails over if a node goes down. Can anyone 'talk me through' the steps needed to get this up and running? First attempts at starting ccsd failed with Failed to connect to cluster manager. Hint: Magma plugins are not in the right spot. I fixed this by cd'ing down into magma in the cluster directory i fetched with cvs and doing make clean make make install When I did a make from top-level of the cvs sources magma got built with plugin dir pointing into the source directory. Just recompiling (without rerunning configure) fixed it. Something to look into for the maintainers? I now have my first gfs file system, but I get a permission denied when trying to mount it. How should I diagnose this? -- birger From birger at birger.sh Thu Apr 14 07:54:36 2005 From: birger at birger.sh (birger) Date: Thu, 14 Apr 2005 09:54:36 +0200 Subject: [Linux-cluster] Problems compiling cluster software on fedora core 3 In-Reply-To: References: <425D0E2C.20102@birger.sh> Message-ID: <425E21BC.9030000@birger.sh> Resending this as I may have sent it using wrong sender address. I never saw it appear... Jonathan E Brassow wrote: > > I shouldn't even be compiling this from the top level. It's not ready > and there needs to be accompanying device-mapper changes. That explains a lot :-D > > Please forgive. I think cmirror is the last thing to compile, so if > you ignore the error the rest should have installed fine. If you > don't like the errors, you can comment out cmirror in the makefile - > which is what I'm going to do right now. I did, and did a new make (that didn't have anything to do) and a make install. Thanks for your answer. It certainly solved my problems. From birger at uib.no Thu Apr 14 05:31:15 2005 From: birger at uib.no (Birger Wathne) Date: Thu, 14 Apr 2005 07:31:15 +0200 Subject: [Linux-cluster] Problems compiling cluster software on fedora core 3 In-Reply-To: References: <425D0E2C.20102@birger.sh> Message-ID: <425E0023.2030609@uib.no> Jonathan E Brassow wrote: > > I shouldn't even be compiling this from the top level. It's not ready > and there needs to be accompanying device-mapper changes. That explains a lot :-D > > Please forgive. I think cmirror is the last thing to compile, so if > you ignore the error the rest should have installed fine. If you > don't like the errors, you can comment out cmirror in the makefile - > which is what I'm going to do right now. I did, and did a new make (that didn't have anything to do) and a make install. Thanks for your answer. It certainly solved my problems. -- birger From birger at uib.no Thu Apr 14 05:52:30 2005 From: birger at uib.no (Birger Wathne) Date: Thu, 14 Apr 2005 07:52:30 +0200 Subject: [Linux-cluster] Configuration of a 2 node HA cluster with gfs Message-ID: <425E051E.2050208@uib.no> I have just installed the cluster package, and I am now looking for some help on how to use it :-) I have a lot of experience with Veritas FirstWatch and some with SunCluster, so I am not new to HA services. Now I have this server that I have to get up and running as quickly as possible... And I find little documentation about how to get this software up and running from a fresh install. I have one old file server with external scsi disks, and one new server with a Nexsan ATAboy RAID array. I want to set up the new file server as half of a 2-node cluster and get it into production. Then move over data (and disks) from the old server until I can reinstall that one as the second cluster node. I first thought along the lines I am used to from Solaris and clustering. I wanted to set up 2 services: 1 NFS service that would take the disk with it when it moved, and 1 samba service that NFS-mounted the disk from the nfs service. After looking at the redhat stuff I am thinking: - Mount the disks permanently on both nodes using gfs (less chance of nuking the file systems because of a split-brain) - Perhaps also run NFS services permanently on both nodes, failing over only the IP address of the official NFS service. Should make failover even faster, but are there pitfalls to running multiple NFS servers off the same gfs file system? In addition to failing over the IP address, I would have to look into how to take along NFS file locks when doing a takeover. Can anyone 'talk me through' the steps needed to get this up and running? I have tried to create /etc/cluster/cluster.conf, but ccsd fails with Failed to connect to cluster manager. Hint: Magma plugins are not in the right spot. -- birger From fabbione at fabbione.net Thu Apr 14 14:44:44 2005 From: fabbione at fabbione.net (Fabio Massimo Di Nitto) Date: Thu, 14 Apr 2005 16:44:44 +0200 (CEST) Subject: [Linux-cluster] [PATCH] Fix cman-kernel build with 2.6.12rc2 Message-ID: <20050414144444.6ABE12A8C@trider-g7.fabbione.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi everybody, the following patch fixes compilation of cnxman.c with 2.6.12rc2 replacing sk_zapped with its new substitute. Please apply. Signed-off-by: Fabio Massimo Di Nitto Index: cman-kernel/src/cnxman.c =================================================================== RCS file: /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v retrieving revision 1.55 diff -u -r1.55 cnxman.c - --- cman-kernel/src/cnxman.c 5 Apr 2005 13:43:09 -0000 1.55 +++ cman-kernel/src/cnxman.c 14 Apr 2005 14:27:03 -0000 @@ -1065,7 +1065,7 @@ if (!capable(CAP_NET_BIND_SERVICE)) return -EPERM; - - if (sk->sk_zapped == 0) + if (sock_flag(sk, SOCK_ZAPPED) == 0) return -EINVAL; if (addr_len != sizeof (struct sockaddr_cl)) @@ -1089,7 +1089,7 @@ up(&port_array_lock); c->port = saddr->scl_port; - - sk->sk_zapped = 0; + sock_reset_flag(sk, SOCK_ZAPPED); /* If we are not a cluster member yet then make the client wait until * we are, this allows nodes to start cluster clients at the same time -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) iQIVAwUBQl6BOlA6oBJjVJ+OAQIUPBAAtPSWMm5k9JvH+fMt+H193fJfgfEWa7EL jfwC2GP/fSAz7RmhXvChdqkrxSBt7VS5oEKwwG6kt7tfJZIvxIACaEzaOOXLidHh UyOWJHUj0twQZQYsBEr8nN7lcfC2+jCiYSgbCoUzD/+36q/QOFoLeeAOa0am+c7t lc4QRmBt1tfGPtf7dTVgNpwX/hZPSlLXLDaBkUBztvncTTZNmmQ15jmjufXcdj31 lGtYzc8we5540YtbWKmFG0/M6AOS0BCOBT+3vLxtwbdXIeO2c+1BOhFr3cWPYgAg 01q/TlMN504P7KyhOm7G/an/exrbzDDKflgzAadEuzDoFAFnDZG0FCdpaz/Fvd3j YkJFPqMJuX0DuhjiHJlwwvzkcjO32RWBAs06lYlCVjyK6mf4GasUF3dJPAIIOhWh ZHK7c0+dWyUA8GINjdJaCkKn3Yz/zmFxFLSUEZsl63A4AXlP7cc+Nz3+0VfmXbVI c97hN/xd3dS/v4LZyE76kHxjTRCyDCKzszF/9iW+0O9mOnSc/FxLfgm8dl86xZVH fpj2fx/8IWDWMXLANANVigXDxjJWZjSyDCZOutbY1Q/0/mUIg/CLZq18HCKjEOKu LmK0V7ayRPop1HFuNFueWA7nOTrKYNjkuZia2uxIE4s+5i7BsiyTgWjxYZHP3LtO 9JefecsNpwc= =QzsY -----END PGP SIGNATURE----- From fabbione at fabbione.net Thu Apr 14 14:45:02 2005 From: fabbione at fabbione.net (Fabio Massimo Di Nitto) Date: Thu, 14 Apr 2005 16:45:02 +0200 (CEST) Subject: [Linux-cluster] [PATCH] Fix dlm-kernel build with 2.6.12rc2 Message-ID: <20050414144502.C75AF2A8C@trider-g7.fabbione.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi everybody, the following patch fixes compilation of nodes.c with 2.6.12rc2. A macro called nodes_clear has been recently introduced. This leads to a clash. I renamed the DLM one to nodes_nodes_clear only to solve the problem, but of course my patch isn't authoritative. Feel free to rename it as you wish :) Signed-off-by: Fabio Massimo Di Nitto Index: dlm-kernel/src/nodes.c =================================================================== RCS file: /cvs/cluster/cluster/dlm-kernel/src/nodes.c,v retrieving revision 1.12 diff -u -r1.12 nodes.c - --- dlm-kernel/src/nodes.c 27 Jan 2005 09:23:45 -0000 1.12 +++ dlm-kernel/src/nodes.c 14 Apr 2005 14:28:08 -0000 @@ -277,7 +277,7 @@ return error; } - -static void nodes_clear(struct list_head *head) +static void nodes_nodes_clear(struct list_head *head) { struct dlm_csb *csb; @@ -290,13 +290,13 @@ void ls_nodes_clear(struct dlm_ls *ls) { - - nodes_clear(&ls->ls_nodes); + nodes_nodes_clear(&ls->ls_nodes); ls->ls_num_nodes = 0; } void ls_nodes_gone_clear(struct dlm_ls *ls) { - - nodes_clear(&ls->ls_nodes_gone); + nodes_nodes_clear(&ls->ls_nodes_gone); } int ls_nodes_init(struct dlm_ls *ls, struct dlm_recover *rv) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) iQIVAwUBQl6BQlA6oBJjVJ+OAQKQYg/+OsUuKjv3gS3T0+c30QNP1hWGBY/QP250 q3stoxo1Nt8NSRmDI2CXuHpCMah5XVPmyf13nMDj2m60VrGi51Jyrt9PDbYhYbXA W1weCsZKf7aD1SVnJ6Dauebj3eU/PPU53n1M8+6RrRqkVLnXW+MqIcEt/SAKO55V uovdTYrZrXJYdR6uU49Ss1qdTGQP6QbLzf8ivXPzl44GlWHhYMSc4WvV3RxRslxI n/BUhCL2MJdhX26r8w5dgL2VG5bnz3x8S8T37Eh8Bs39VAwuleW5Cfpr4MV4Yhuw WTrkIF56BFSiqDvvLuvhVVZlAFKvqeU6Xp9kVMUMAOJ7tjTwwwbX+TTUL/YAi7Yp UqaMk9yAhgiVsBK+/0AqHJYx1mGfZhgbQ1A3Wr0uIADDsoHI4OP3ZsaYnKlX04rm 1JoxF1I4nHg0hlgGJHLCaGTTIRuVKzIvutFFZL9oWj6fp3mxw6MYo7nWxPxjMAhT hCfo8YjoGlpUVlnBgOSRCVUNaCXwcxNWgRyoVfn8yX+vufTQZxUCzX5SkiFcMDn1 uP8sUFvNoSvNYhzutd4Ma5pb2I+Qu4zVxKFRQ7rBSOBFn9UI4klS5Eu9JlMAzd5f fmDK4lXDM/9yjjqSbQTNAV2gSkrwwtxfu2DSGXja/Xkh5MtPdv2OEbXGLLkTiZ8g u/FUdUCytd8= =pKJ1 -----END PGP SIGNATURE----- From fabbione at fabbione.net Thu Apr 14 14:45:19 2005 From: fabbione at fabbione.net (Fabio Massimo Di Nitto) Date: Thu, 14 Apr 2005 16:45:19 +0200 (CEST) Subject: [Linux-cluster] [PATCH] Fix gnbd-kernel build with 2.6.12rc2 Message-ID: <20050414144519.546922A8C@trider-g7.fabbione.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi everybody, the following patch fixes compilation of gnbd.c with 2.6.12rc2. The i_sock has been recently removed from the inode structure (change happened in the kernel tree the 1st of April) and made part of i_mode. Please apply. Signed-off-by: Fabio Massimo Di Nitto Index: gnbd-kernel/src/gnbd.c =================================================================== RCS file: /cvs/cluster/cluster/gnbd-kernel/src/gnbd.c,v retrieving revision 1.7 diff -u -r1.7 gnbd.c - --- gnbd-kernel/src/gnbd.c 7 Apr 2005 16:19:37 -0000 1.7 +++ gnbd-kernel/src/gnbd.c 14 Apr 2005 14:30:29 -0000 @@ -735,7 +735,7 @@ if (!file) return error; inode = file->f_dentry->d_inode; - - if (!inode->i_sock) { + if (!S_ISSOCK(inode->i_mode)) { fput(file); return error; } -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) iQIVAwUBQl6BTVA6oBJjVJ+OAQIU7BAAnea52QS9ISXWHXWrrqeEaqFVbm1bSs1A +BKMycDiSsDKwttb+/bma2V56gjdqnv7//11wv2IiG5lt1q1HebgVTM+ecPMCRBb 6VsJV2NB+HgjRtcNkbAiw7hLVpcG+WFe5VaFVSsG20B5I47n9ahkF0a8umY4zSbd O1pCBJA3H4QMiTwlNA8kEj5EBdc3/jB4KCYGwGNhR7m61etZ4JMiEdGlOeQwYMK1 4DcXpCgo8aBLACUHGST2e3mnq48ztHHMNI7M0H8BLNrUbhm1EtIEtzyXqJjrS7ku TNZKKyfjlioAJk4B718ValMMEifZtlxwjlT3FEYfEd7/MUA2sw6ET4arFbDKcGjU Bn5wdFdoVDZpDwhWICfQq2rVleBydNGCyZ4HYMcI3WBi3RKH21zrLnt5YqL9EA/9 9TC8PhD24i8+9rp/kmRV3QtWJtooEO2VSfGKJSDXHoeKkt8S2RTByxuBo5UpBMkI z/+lB8zlDyF+qvn3TtkaTuJC8fk3clrkQfT+jiI4/7ZztK37NgcCF9Qe1rac3QS4 VFRTrYJD8hcAOMa40HHCdZTyezetE4N/m6SDOJ+Pps+2KTWYxkJguas0+Aua5yeP jyyAV3vmKMmPewbNknw1gHoPTI4pz1QUZ89E3hhnmM1Zoi6y4CMzq1ndv/ZqAROx cS4j9lsnd60= =+YaG -----END PGP SIGNATURE----- From CAugustine at overlandstorage.com Thu Apr 14 18:22:15 2005 From: CAugustine at overlandstorage.com (CAugustine at overlandstorage.com) Date: Thu, 14 Apr 2005 11:22:15 -0700 Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 12, Issue 9 Message-ID: Hi Everyone, I have tried to down load the cluster software by running: cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster login cvs and providing the password=cvs. Unfortunately, the connection to sources.redhat.com (12. 107.209.250):2401 faileds with a "connection time out" messages. I am not sure if there is a problem at redhat or locally at my site... Any suggestions? Thanks, Caroline ---------------------------------------------------------------------------------------------- See our award-winning line of tape and disk-based backup & recovery solutions at http://www.overlandstorage.com ---------------------------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From rstevens at vitalstream.com Thu Apr 14 19:26:30 2005 From: rstevens at vitalstream.com (Rick Stevens) Date: Thu, 14 Apr 2005 12:26:30 -0700 Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 12, Issue 9 In-Reply-To: References: Message-ID: <425EC3E6.8040608@vitalstream.com> CAugustine at overlandstorage.com wrote: > > Hi Everyone, > > I have tried to down load the cluster software by running: > > cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster login cvs > > and providing the password=cvs. Unfortunately, the connection to > sources.redhat.com (12. > 107.209.250):2401 faileds with a "connection time out" messages. I am > not sure if there is > a problem at redhat or locally at my site... Your firewall probably blocks TCP/UDP port 2401. CVS :pserver: operations use that port. Poke a hole in your firewall to allow incoming data on both TCP port 2401 and UDP port 2401. ---------------------------------------------------------------------- - Rick Stevens, Senior Systems Engineer rstevens at vitalstream.com - - VitalStream, Inc. http://www.vitalstream.com - - - - If at first you don't succeed, quit. No sense being a damned fool! - ---------------------------------------------------------------------- From pcaulfie at redhat.com Fri Apr 15 08:08:59 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Fri, 15 Apr 2005 09:08:59 +0100 Subject: [Linux-cluster] [PATCH] Fix dlm-kernel build with 2.6.12rc2 In-Reply-To: <20050414144502.C75AF2A8C@trider-g7.fabbione.net> References: <20050414144502.C75AF2A8C@trider-g7.fabbione.net> Message-ID: <20050415080859.GA23730@tykepenguin.com> On Thu, Apr 14, 2005 at 04:45:02PM +0200, Fabio Massimo Di Nitto wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi everybody, > > the following patch fixes compilation of nodes.c with 2.6.12rc2. Thanks for those. I'll apply them to CVS head when 2.6.12 is released. -- patrick From fabbione at fabbione.net Fri Apr 15 10:52:44 2005 From: fabbione at fabbione.net (Fabio Massimo Di Nitto) Date: Fri, 15 Apr 2005 12:52:44 +0200 Subject: [Linux-cluster] [PATCH] Fix dlm-kernel build with 2.6.12rc2 In-Reply-To: <20050415080859.GA23730@tykepenguin.com> References: <20050414144502.C75AF2A8C@trider-g7.fabbione.net> <20050415080859.GA23730@tykepenguin.com> Message-ID: <425F9CFC.1000901@fabbione.net> Patrick Caulfield wrote: > On Thu, Apr 14, 2005 at 04:45:02PM +0200, Fabio Massimo Di Nitto wrote: > >>-----BEGIN PGP SIGNED MESSAGE----- >>Hash: SHA1 >> >>Hi everybody, >> >>the following patch fixes compilation of nodes.c with 2.6.12rc2. > > > Thanks for those. I'll apply them to CVS head when 2.6.12 is released. Welcome :) Fabio From mrc at linuxplatform.org Fri Apr 15 13:19:48 2005 From: mrc at linuxplatform.org (Matt) Date: Fri, 15 Apr 2005 09:19:48 -0400 Subject: [Linux-cluster] DB Clustering Question Message-ID: <1113571189.6839.6.camel@althea.playway.net> Hi everyone, I'm new to this list. I'm researching database cluster solutions and I'm not really finding what I'm looking for. What I really want to do is parallel processing with mySQL or Postgresql. If I can't do that, then simply having multiple SQL servers share the same DB files is the next option. Can anyone push me in the right direction? One last question, does anyone have any experience with the Ingres database and its clustering features? -- Matt From gwood at dragonhold.org Fri Apr 15 13:33:48 2005 From: gwood at dragonhold.org (gwood at dragonhold.org) Date: Fri, 15 Apr 2005 14:33:48 +0100 (BST) Subject: [Linux-cluster] DB Clustering Question In-Reply-To: <1113571189.6839.6.camel@althea.playway.net> References: <1113571189.6839.6.camel@althea.playway.net> Message-ID: <18349.198.96.134.61.1113572028.squirrel@198.96.134.61> > What I really want to do is parallel processing with mySQL or > Postgresql. This needs support at the DB level. MySQL has a version that requires the DB to be smaller than the amount of available memory - since the DB gets kept in RAM of all the clustered servers (not sure if it does more than 2, I can't afford that much memory *grin*). Not sure about postgresql, you'll have to check their website for cluster options. If you're willing to spend money on applications, there is a 3rd party addon for MySQL that does it, but it's been a while since I looked at it, so I don't know the details. From memory, it intercepts queries at the TCP/IP stack layer - all the machines have the same MAC for a virtual server as well as sharing state data (even to the TCP/IP layer), and therefore can take over any existing connections for running servers. It does mean that you have all the machines handling all the incoming network traffic which is less than ideal. > If I can't do that, then simply having multiple SQL servers > share the same DB files is the next option. Can anyone push me in the > right direction? This won't work, at least not without application layer support. In the same way that you need GFS to get multiple machines to use the same filesystem, you'd need a similar level of support for locking & caching within the database. I think that Oracle have it (for their RAC product) and probably some others too, but I don't know of anything similar in MySQL at least. If your usage is skewed to reads rather than writes, then you could probably do something with replication, but there are details on that on the various websites too. Hope this helps some, Graham From chrisd at pearsoncmg.com Fri Apr 15 14:19:42 2005 From: chrisd at pearsoncmg.com (Chris Darroch) Date: Fri, 15 Apr 2005 10:19:42 -0400 Subject: [Linux-cluster] DB Clustering Question In-Reply-To: <1113571189.6839.6.camel@althea.playway.net> References: <1113571189.6839.6.camel@althea.playway.net> Message-ID: <425FCD7E.3000705@pearsoncmg.com> Matt wrote: > What I really want to do is parallel processing with mySQL or > Postgresql. If I can't do that, then simply having multiple SQL servers > share the same DB files is the next option. Can anyone push me in the > right direction? I'm new to the list as well, but having just gone through the process of evaluating exactly this kind of problem, I have a few cents I can throw in. I think the very short answer to your question is that databases and multiple servers don't mix well at all, as a general rule, and if you need full transactional SQL support in a cluster, you're likely looking at a commerical solution. The fundamental problem is that transactional databases, of which SQL databases are a subset, need to ensure that all transactions occur atomically, and to do this, they need very robust, very fast locking subsystems. For example, before updating a row in a table, a database process needs to be sure that it acquires a lock on that data first, so that other database processes handling other client requests don't read partially altered data. Now locking is hard enough to do when you have just one machine (either single CPU or multiple CPUs), but can be done quite effectively and efficiently through the use of in-memory mutexes and other such devices. Oracle, for example, takes out a big chunk of shared memory, which all processes use to coordinate locking. Doing this in a cluster of machines is much, much more difficult. It's compounded by the problem that one or more machines could fail, or the network could fail in various ways, and the DB software must ensure that under no conditions does the data become corrupted. (See all the work involved in the GFS DLM, for example, involving handling "split brain" conditions and the such like.) Oracle RAC (Real Application Cluster) provides this functionality at considerable expense, for instance, by requiring that you have a high-speed interconnection network between your machines, and then by providing its own internal lock manager and cluster monitor and so forth. Essentialy, many of the components of GFS are provided inside Oracle RAC, for its own purposes, but are unavailable to outside processes. You can also run Oracle RAC on Linux, in various ways: http://www.redhat.com/software/rha/gfs/ http://www.oracle.com/technology/tech/linux/index.html http://www.veritas.com/van/articles/7655.jsp If I understand the RedHat option correctly, Oracle relies on GFS to manage the shared storage in the cluster, but still uses its own lock manager, cluster monitor, etc., for its own internal cache management and transaction handling. However, I haven't read the installation white paper, so I'm not sure about that. (Note to RedHat folks: trying to register on the Web site leads to an access denied error for the /info/ page.) Open source SQL databases like PostgreSQL and MySQL just don't have this kind of feature, so far as I can determine. MySQL provides a cluster mechanism over regular TCP, but as far as I could tell from the documentation, this works by keeping the entire database in RAM on each cluster node: http://dev.mysql.com/doc/mysql/en/multi-hardware-software-network.html PostgreSQL can be run in a cluster by emulating a single operating system underneath it, using high-speed interconnections and special kernel modifications: http://www.linuxlabs.com/clusgres.html I don't know much about Ingres myself, but I didn't see anything about clustering for that, either. It's perhaps worth noting that PostgreSQL and Oracle face special complexities regarding data consistency and locking because they provide MVCC (Multi-Version Concurrency Control), which means that each database client sees a "snapshot" of the entire database as it was when they began their transaction. As long as their transaction remains active, the database retains previous versions of all data modified by all other active transactions, so that the snapshot remains accurate to a past point in time. Only once the transaction has closed can the database clean up old versions of data. This is subtly different from just providing row-level locking in a table; if one transaction is slowing reading through all the rows of a table while another one performs updates of selected rows, the old versions of the updated rows are kept around until the reader's transaction closes, in case they are needed to provide an accurate view of what the data in the table looked like when the reader's transaction began. So that's all just to say that the business of locking and shuffling data around is especially complex for such databases, and doing it in a cluster even more so. What you are able to do with the available options depends partly on your requirements, obviously. If you don't mind having multiple read-only copies of your database files, and allowing them to be somewhat out of date, there are various ways you could replicate your data files from a master read-write node to multiple read-only nodes. You'd want to ensure that the copying process performed the necessary interactions with the master database to ensure that it never copied partially complete data files; performing a hot backup and then replicating those files to the read-only nodes would work. Another related option if you don't mind having read-only and slightly out-of-date copies is to use memcached: http://www.danga.com/memcached/ This functions as a data cache between your client programs and the database, and spreads the data around to multiple machines. But obviously write requests need to go to the master database, and then be replicated to the caches, and there's a period of time when you might not read up-to-date data from the cache. But this may be OK for your application. If you need true full transactional SQL support spread across a cluster, I believe you'll have to look at Oracle or another commerical solution like the ClusGres one I referenced above. I'd love to stand correctly, though, if anyone knows more about this. Chris. -- GPG Key ID: 366A375B GPG Key Fingerprint: 485E 5041 17E1 E2BB C263 E4DE C8E3 FA36 366A 375B From Hansjoerg.Maurer at dlr.de Fri Apr 15 14:24:21 2005 From: Hansjoerg.Maurer at dlr.de (Hansjoerg.Maurer at dlr.de) Date: Fri, 15 Apr 2005 16:24:21 +0200 Subject: [Linux-cluster] AW: GNBD multipath with devicemapper? -- possible solution Message-ID: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de> Hi I found a solution for the problem descriped below, but I am not sure if it is the right way. - importing the two gnbd's (wich point to the same device) from two servers -> /dev/gnbd0 and /dev/gnbd1 on the client - creating a multipath device with something like this: echo "0 167772160 multipath 0 0 1 1 round-robin 0 2 1 251:0 1000 251:1 1000 " | dmsetup create dm0 (251:0 ist the major:minor id of /dev/gnbd0) - mounting the created device eg: mount -t gfs /dev/mapper/dm0 /mnt/lvol0 If I do a write on /mnt/lvol0 the gnbd_server task on both gnbd_servers start (with a noticeable speedup) If one gnbd_server fails dm removes that path with the following log kernel: device-mapper: dm-multipath: Failing path 251:0. I was able to add it again with dmsetup message dm0 0 reinstate_path 251:0 I was able to deactivate a path manually with dmsetup message dm0 0 fail_path 251:0 But I can not unimport the underlying gnbd gnbd_import: ERROR cannot disconnect device #1 : Device or resource busy Is there a way to remove a gnbd, which is bunndled in a dm-multipath device? (might be necessary, if one gnbd server must be rebooted) How can I reimport an gnbd on the client in state disconnected? (I had to manually start gnbd_recvd -d 0 to do so) Is the descriped solution for gnbd multipath the right one? Thank you very much Greetings from munich Hansj?rg >Hi > >I am trying to set up gnbd with multipath. >Accoding to the gnbd_usage.txt file, I understand, that this should work with >dm-multipath. >But unfortunatly only the gfs part of the setup is descriped there. > >Has anybody experiance with this setup, especially how to set up >multipath with multiple /dev/gnbd* and how to setup the multipath.conf file > > >Thank you very much > >Hansj?rg Maurer -- _________________________________________________________________ Dr. Hansjoerg Maurer | LAN- & System-Manager | Deutsches Zentrum | DLR Oberpfaffenhofen f. Luft- und Raumfahrt e.V. | Institut f. Robotik | Postfach 1116 | Muenchner Strasse 20 82230 Wessling | 82234 Wessling Germany | | Tel: 08153/28-2431 | E-mail: Hansjoerg.Maurer at dlr.de Fax: 08153/28-1134 | WWW: http://www.robotic.dlr.de/ __________________________________________________________________ There are 10 types of people in this world, those who understand binary and those who don't. From teigland at redhat.com Fri Apr 15 14:39:53 2005 From: teigland at redhat.com (David Teigland) Date: Fri, 15 Apr 2005 22:39:53 +0800 Subject: [Linux-cluster] DB Clustering Question In-Reply-To: <1113571189.6839.6.camel@althea.playway.net> References: <1113571189.6839.6.camel@althea.playway.net> Message-ID: <20050415143953.GB11756@redhat.com> On Fri, Apr 15, 2005 at 09:19:48AM -0400, Matt wrote: > What I really want to do is parallel processing with mySQL or > Postgresql. If I can't do that, then simply having multiple SQL servers > share the same DB files is the next option. Can anyone push me in the > right direction? > > One last question, does anyone have any experience with the Ingres > database and its clustering features? I believe Ingres is the only cluster database that's open source. When I looked a few months ago there was some work needed to hook it into our cluster/lock managers, but that didn't look too bad as they were already able to switch between different clustering/locking infrastrutures. -- Dave Teigland From lhh at redhat.com Fri Apr 15 14:59:27 2005 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 15 Apr 2005 10:59:27 -0400 Subject: [Linux-cluster] Configuration of a 2 node HA cluster with gfs In-Reply-To: <425E2180.6060609@birger.sh> References: <425E2180.6060609@birger.sh> Message-ID: <1113577167.20618.155.camel@ayanami.boston.redhat.com> On Thu, 2005-04-14 at 09:53 +0200, birger wrote: > - Mount the disks permanently on both nodes using gfs (less chance of > nuking the file systems because of a split-brain) The way GFS forcefully prevents I/O (which protects data!) is via "fencing" (fibre channel zoning or a remote power controller/integrated power control, etc). This prevents the block I/Os from hitting the disks for a node which has died, and works with any file system (not just GFS). Fencing is required in order for CMAN to operate in any useful capacity in 2-node mode. Anyway, to make this short: You probably want fencing for your solution. > - Perhaps also run NFS services permanently on both nodes, failing over > only the IP address of the official NFS service. Should make failover > even faster, but are there pitfalls to running multiple NFS servers off > the same gfs file system? In addition to failing over the IP address, I > would have to look into how to take along NFS file locks when doing a > takeover. With GFS, the file locking should just kind of "work", but the client would be required to fail over. I don't think the Linux NFS client can do this, but I believe the Solaris one can... (correct me if I'm wrong here). Failing over just an IP may work, but there may be some issues as well. In any case, we should certainly *make* it work if it doesn't at the moment, eh? :) With a pure NFS failover solution (ex: on ext3, w/o replicated cluster locks), there needs to be some changes to nfsd, lockd, and rpc.statd in order to make lock failover work seamlessly. > Can anyone 'talk me through' the steps needed to get this up and running? Well, there's a start of the issues. You can use rgmanager to do the IP and Samba failover. Take a look at "rgmanager/src/daemons/tests/*.conf". I don't know how well Samba failover has been tested. -- Lon From chrisd at pearsoncmg.com Fri Apr 15 16:31:01 2005 From: chrisd at pearsoncmg.com (Chris Darroch) Date: Fri, 15 Apr 2005 12:31:01 -0400 Subject: [Linux-cluster] DB Clustering Question In-Reply-To: <20050415143953.GB11756@redhat.com> References: <1113571189.6839.6.camel@althea.playway.net> <20050415143953.GB11756@redhat.com> Message-ID: <425FEC45.8060302@pearsoncmg.com> Hi -- David Teigland wrote: > I believe Ingres is the only cluster database that's open source. When I > looked a few months ago there was some work needed to hook it into our > cluster/lock managers, but that didn't look too bad as they were already > able to switch between different clustering/locking infrastrutures. That's very interesting -- I should have looked more closely at Ingres R3! :-) Naturally, their site is mostly down now that I want to look at it. Seems like they adopted OpenDLM last year. I can't quite tell, but if the Ingres Grid Option is their "single DB clustering" option, it seems to not support things like row-level locks and "update mode locks". (The Distributed Option appears to be a DTP solution for heterogeneous DBs, and the Replicator Option one for replicating between Ingres DBs, both based on two-phase commits. It looks like you turn of two-phase commits when using the Replicator and Grid Options together.) Errors are mine due to overly quick scanning of documents. I wrote: > It's perhaps worth noting that PostgreSQL and Oracle face special > complexities regarding data consistency and locking because they > provide MVCC (Multi-Version Concurrency Control) ... One small pointless correction to my own tangent is that Oracle actually calls their version MVRC and it works a little differently than PostgreSQL's, but the rough idea is the same. Chris. -- GPG Key ID: 366A375B GPG Key Fingerprint: 485E 5041 17E1 E2BB C263 E4DE C8E3 FA36 366A 375B From fedora at nodata.co.uk Fri Apr 15 16:46:09 2005 From: fedora at nodata.co.uk (nodata) Date: Fri, 15 Apr 2005 18:46:09 +0200 Subject: [Linux-cluster] DB Clustering Question In-Reply-To: <425FCD7E.3000705@pearsoncmg.com> References: <1113571189.6839.6.camel@althea.playway.net> <425FCD7E.3000705@pearsoncmg.com> Message-ID: <1113583569.3224.13.camel@sb-home.lan> On Fri, 2005-04-15 at 10:19 -0400, Chris Darroch wrote: > Matt wrote: > > > What I really want to do is parallel processing with mySQL or > > Postgresql. If I can't do that, then simply having multiple SQL servers > > share the same DB files is the next option. Can anyone push me in the > > right direction? > > I'm new to the list as well, but having just gone through the process > of evaluating exactly this kind of problem, I have a few cents I can > throw in. > > I think the very short answer to your question is that databases and > multiple servers don't mix well at all, as a general rule, and if you > need full transactional SQL support in a cluster, you're likely looking > at a commerical solution. > > The fundamental problem is that transactional databases, of which > SQL databases are a subset, need to ensure that all transactions occur > atomically, and to do this, they need very robust, very fast locking > subsystems. For example, before updating a row in a table, a > database process needs to be sure that it acquires a lock on that data > first, so that other database processes handling other client requests > don't read partially altered data. > > Now locking is hard enough to do when you have just one machine > (either single CPU or multiple CPUs), but can be done quite effectively > and efficiently through the use of in-memory mutexes and other such > devices. Oracle, for example, takes out a big chunk of shared memory, > which all processes use to coordinate locking. > > Doing this in a cluster of machines is much, much more difficult. > It's compounded by the problem that one or more machines could fail, > or the network could fail in various ways, and the DB software must > ensure that under no conditions does the data become corrupted. > (See all the work involved in the GFS DLM, for example, involving > handling "split brain" conditions and the such like.) > > Oracle RAC (Real Application Cluster) provides this functionality > at considerable expense, for instance, by requiring that you have > a high-speed interconnection network between your machines, and > then by providing its own internal lock manager and cluster monitor > and so forth. Essentialy, many of the components of GFS are > provided inside Oracle RAC, for its own purposes, but are unavailable > to outside processes. You can also run Oracle RAC on Linux, in > various ways: > > http://www.redhat.com/software/rha/gfs/ > http://www.oracle.com/technology/tech/linux/index.html > http://www.veritas.com/van/articles/7655.jsp > > If I understand the RedHat option correctly, Oracle relies on GFS > to manage the shared storage in the cluster, but still uses its own > lock manager, cluster monitor, etc., for its own internal cache > management and transaction handling. However, I haven't read the > installation white paper, so I'm not sure about that. (Note to > RedHat folks: trying to register on the Web site leads to an > access denied error for the /info/ page.) > > Open source SQL databases like PostgreSQL and MySQL just don't > have this kind of feature, so far as I can determine. MySQL provides a > cluster mechanism over regular TCP, but as far as I could tell from the > documentation, this works by keeping the entire database in RAM on each > cluster node: > > http://dev.mysql.com/doc/mysql/en/multi-hardware-software-network.html > > PostgreSQL can be run in a cluster by emulating a single operating > system underneath it, using high-speed interconnections and special > kernel modifications: > > http://www.linuxlabs.com/clusgres.html > > I don't know much about Ingres myself, but I didn't see anything > about clustering for that, either. > > It's perhaps worth noting that PostgreSQL and Oracle face special > complexities regarding data consistency and locking because they > provide MVCC (Multi-Version Concurrency Control), which means that > each database client sees a "snapshot" of the entire database > as it was when they began their transaction. As long as their > transaction remains active, the database retains previous versions > of all data modified by all other active transactions, so that > the snapshot remains accurate to a past point in time. Only once > the transaction has closed can the database clean up old versions of > data. This is subtly different from just providing row-level locking > in a table; if one transaction is slowing reading through all the > rows of a table while another one performs updates of selected rows, > the old versions of the updated rows are kept around until the > reader's transaction closes, in case they are needed to provide an > accurate view of what the data in the table looked like when the > reader's transaction began. So that's all just to say that the > business of locking and shuffling data around is especially complex > for such databases, and doing it in a cluster even more so. > > What you are able to do with the available options depends partly > on your requirements, obviously. If you don't mind having multiple > read-only copies of your database files, and allowing them to be > somewhat out of date, there are various ways you could replicate > your data files from a master read-write node to multiple read-only > nodes. You'd want to ensure that the copying process performed > the necessary interactions with the master database to ensure that > it never copied partially complete data files; performing a hot > backup and then replicating those files to the read-only nodes would > work. > > Another related option if you don't mind having read-only and > slightly out-of-date copies is to use memcached: > > http://www.danga.com/memcached/ > > This functions as a data cache between your client programs and > the database, and spreads the data around to multiple machines. > But obviously write requests need to go to the master database, > and then be replicated to the caches, and there's a period of time > when you might not read up-to-date data from the cache. But this > may be OK for your application. > > If you need true full transactional SQL support spread across a > cluster, I believe you'll have to look at Oracle or another commerical > solution like the ClusGres one I referenced above. I'd love to > stand correctly, though, if anyone knows more about this. > > Chris. > If the database is mainly used for reads, you should check out emic networks' product. It will allow you to cluster mysql across multiple boxes, and if a node fails, it doesn't matter. If you want to add more boxes you can too. It's load balanced, and writes are atomic across the cluster. Interestingly, it does NOT require an in-RAM database. See http://www.emicnetworks.com/ From srinisan at fmailbox.com Fri Apr 15 17:24:44 2005 From: srinisan at fmailbox.com (Srini Sankaran) Date: Fri, 15 Apr 2005 10:24:44 -0700 Subject: [Linux-cluster] Can LOCK_NOLOCK be used in this situation? Message-ID: <37c450ffae096dcee27edda8e07e5d7a@fmailbox.com> I don't have a GFS cluster running right now. I'd appreciate some guidance on using GFS for the following situation: I need a cluster of nodes to have read and write access to a scalable and common pool of storage connected to an FC SAN. The entire pool of storage must appear as one single file system to the cluster nodes. So far so good. GFS fits. But... The application running on each node is partitioned in such a way that at any given moment, only one node will need read / write access to a directory and its descendant file tree. For example, let's say the file system is called "/big" and it has directories "a", "b", ... "z". Let's say that I have cluster node "1", "2", and "3". When node 1 needs access to "/big/a", the other nodes "2" and "3", won't need access to "/big/a". Those nodes will be reading and writing in to "/big/b" or "/big/c" or something else. In general, "/big/a" and other directories could have several million files. A few minutes to hours later, node 2 might take over the read / write responsibilities for "/big/a", and node 1 might move over to "/big/b", etc. From reading the GFS documentation, it certainly appears that a standard GFS with locking (single or redundant servers) would work in this situation. But, I would like to avoid designating any single or multiple servers as lock servers. This is because the cluster is very dynamic. Nodes can constantly be added or removed, and the system administration environment isn't conducive for designating lock servers and protecting them. Besides, I am wondering why the lock servers should work so hard to maintain all the locks on the millions of files when I know 100% that no other node is going to access the files simultaneously. So, my question is: Can I simply use LOCK_NOLOCK in this situation and avoid any lock server? Maybe the answer is no because the documentation warns "Do not allow multiple nodes to mount the same file system while LOCK_NOLOCK is used. Doing so causes one or more nodes to panic their kernels, and may cause file system corruption". I am still asking the question because of this partitioned file system access characteristic of my application. Is this warning still valid if I can guarantee that no two files or directories will be accessed by two different nodes simultaneously? If I can't do LOCK_NOLOCK, is there any other idea I can use here? Thanks for your time From kpreslan at redhat.com Fri Apr 15 18:02:10 2005 From: kpreslan at redhat.com (Ken Preslan) Date: Fri, 15 Apr 2005 13:02:10 -0500 Subject: [Linux-cluster] Can LOCK_NOLOCK be used in this situation? In-Reply-To: <37c450ffae096dcee27edda8e07e5d7a@fmailbox.com> References: <37c450ffae096dcee27edda8e07e5d7a@fmailbox.com> Message-ID: <20050415180210.GA19976@potassium.msp.redhat.com> On Fri, Apr 15, 2005 at 10:24:44AM -0700, Srini Sankaran wrote: ... > A few minutes to hours later, node 2 might take over the read / write > responsibilities for "/big/a", and node 1 might move over to "/big/b", > etc. ... > So, my question is: Can I simply use LOCK_NOLOCK in this situation and > avoid any lock server? Maybe the answer is no because the documentation > warns "Do not allow multiple nodes to mount the same file system while > LOCK_NOLOCK is used. Doing so causes one or more nodes to panic their > kernels, and may cause file system corruption". > > I am still asking the question because of this partitioned file system > access characteristic of my application. Is this warning still valid if > I can guarantee that no two files or directories will be accessed by > two different nodes simultaneously? If I can't do LOCK_NOLOCK, is there > any other idea I can use here? Nolock won't work here. Even if the directory tree is partitioned between nodes, the allocation bitmaps aren't. Allocate enough and you'll see contention there. And without locking, you'll see corruption there too. You also need locking to manage the transitions when a machine switches directories. Caches need to be flushed and invalidated. The locking makes that happen. If you're reluctance to use locking is just because you don't want dedicated GULM lock servers, you might want to try the DLM instead. -- Ken Preslan From fabbione at fabbione.net Sat Apr 16 08:33:44 2005 From: fabbione at fabbione.net (Fabio Massimo Di Nitto) Date: Sat, 16 Apr 2005 10:33:44 +0200 (CEST) Subject: [Linux-cluster] [PATCH] Fix usage of sk_alloc in cman-kernel (2.6.12rc2) Message-ID: <20050416083344.1BCE02B9F@trider-g7.fabbione.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi everybody, the 26th of March 2005 Arnaldo Carvalho de Melo commited a quite big change to sk_alloc: ChangeSet 1.2181.42.2 2005/03/26 20:04:49 acme at toy.ghostprotocols.net [NET] make all protos partially use sk_prot sk_alloc_slab becomes proto_register, that receives a struct proto not necessarily completely filled, but at least with the proto name, owner and obj_size (aka proto specific sock size), with this we can remove the struct sock sk_owner and sk_slab, using sk->sk_prot->{owner,slab} instead. This patch also makes sk_set_owner not necessary anymore, as at sk_alloc time we have now access to the struct proto onwer and slab members, so we can bump the module refcount exactly at sock allocation time. Another nice "side effect" is that this patch removes the generic sk_cachep slab cache, making the only last two protocols that used it use just kmalloc, informing a struct proto obj_size equal to sizeof(struct sock). Ah, almost forgot that with this patch it is very easy to use a slab cache, as it is now created at proto_register time, and all protocols need to use proto_register, so its just a matter of switching the second parameter of proto_register to '1', heck, this can be done even at module load time with some small additional patch. Another optimization that will be possible in the future is to move the sk_protocol and sk_type struct sock members to struct proto, but this has to wait for all protocols to move completely to sk_prot. This changeset also introduces /proc/net/protocols, that lists the registered protocols details, some may seem excessive, but I'd like to keep them while working on further struct sock hierarchy work and also to realize which protocols are old ones, i.e. that still use struct proto_ops, etc, yeah, this is a bit of an exaggeration, as all protos still use struct proto_ops, but in time the idea is to move all to use sk->sk_prot and make the proto_ops infrastructure be shared among all protos, reducing one level of indirection. Signed-off-by: Arnaldo Carvalho de Melo Signed-off-by: David S. Miller The same change needs to be propagated to cman-kernel (probably more, but i am working on one module at a time). Here is a preliminary patch that works for me. Please review before applying. Signed-off-by: Fabio M. Di Nitto Index: cnxman.c =================================================================== RCS file: /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v retrieving revision 1.55 diff -u -r1.55 cnxman.c - --- cnxman.c 5 Apr 2005 13:43:09 -0000 1.55 +++ cnxman.c 16 Apr 2005 08:20:42 -0000 @@ -66,8 +66,8 @@ extern void cman_set_realtime(struct task_struct *tsk, int prio); static struct proto_ops cl_proto_ops; +static struct proto cl_proto; static struct sock *master_sock; - -static kmem_cache_t *cluster_sk_cachep; /* Pointer to the pseudo node that maintains quorum in a 2node system */ struct cluster_node *quorum_device = NULL; @@ -918,14 +918,14 @@ return; } - -static struct sock *cl_alloc_sock(struct socket *sock, int gfp) +static struct sock *cl_alloc_sock(struct socket *sock, int gfp, int protocol) { struct sock *sk; struct cluster_sock *c; if ((sk = - - sk_alloc(AF_CLUSTER, gfp, sizeof (struct cluster_sock), - - cluster_sk_cachep)) == NULL) + sk_alloc(AF_CLUSTER, gpf, &cl_proto, + 1)) == NULL) goto no_sock; if (sock) { @@ -937,6 +937,7 @@ sk->sk_no_check = 1; sk->sk_family = PF_CLUSTER; sk->sk_allocation = gfp; + sk->sk_protocol = protocol; c = cluster_sk(sk); c->port = 0; @@ -1031,7 +1032,7 @@ if (!atomic_read(&cnxman_running) && protocol != CLPROTO_MASTER) return -ENETDOWN; - - if ((sk = cl_alloc_sock(sock, GFP_KERNEL)) == NULL) + if ((sk = cl_alloc_sock(sock, GFP_KERNEL, protocol)) == NULL) return -ENOBUFS; sk->sk_protocol = protocol; @@ -4155,6 +4156,12 @@ .owner = THIS_MODULE, }; +static struct proto cl_proto = { + .name = "CMAN", + .owner = THIS_MODULE, + .obj_size = sizeof(struct cluster_sock) +}; + #ifdef MODULE MODULE_DESCRIPTION("Cluster Connection and Service Manager"); MODULE_AUTHOR("Red Hat, Inc"); @@ -4166,19 +4173,14 @@ printk("CMAN %s (built %s %s) installed\n", CMAN_RELEASE_NAME, __DATE__, __TIME__); - - if (sock_register(&cl_family_ops)) { - - printk(KERN_INFO "Unable to register cluster socket type\n"); + if (proto_register(&cl_proto,0) < 0) { + printk(KERN_INFO "Unable to register cluster protocol type\n"); return -1; } - - /* allocate our sock slab cache */ - - cluster_sk_cachep = kmem_cache_create("cluster_sock", - - sizeof (struct cluster_sock), 0, - - SLAB_HWCACHE_ALIGN, 0, 0); - - if (!cluster_sk_cachep) { - - printk(KERN_CRIT - - "cluster_init: Cannot create cluster_sock SLAB cache\n"); - - sock_unregister(AF_CLUSTER); + if (sock_register(&cl_family_ops)) { + proto_unregister(&cl_proto); + printk(KERN_INFO "Unable to register cluster socket type\n"); return -1; } @@ -4234,7 +4236,7 @@ cnxman_ioctl32_exit(); #endif sock_unregister(AF_CLUSTER); - - kmem_cache_destroy(cluster_sk_cachep); + proto_unregister(&cl_proto); } module_init(cluster_init); -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) iQIVAwUBQmDNYFA6oBJjVJ+OAQJzyQ/+PjjPRmdqGKzpsms+96wTSzw5iaEsZHx4 9tZF6nbVBaCoygB9B0xkR0ra37DwZg+vWHOlzcS6HoHkiz0LveeXWb6Xu9bsTu2a /9pIFSXFAaiwJTCE7FEHamHgm7yf2SyVyL2BS+05UzvYsfoG9JTIX2b8gsBtfb5J qF5sZIqYrcrGn3wNLqxID+qgb1pKcgQfUGOWAVrdVy0xP2xClJQKSyFCsRcwCUmW 2qzIPW3DtBe996rlwVZAkupvHfueqGTkXNjhockah37+jO0KivcUA6ej2m+ZO1mk Rc2Q5mEvjsq5UHHFXO27BomLXNYXdge9HZ9cAvip4tGvlby2PA90R0txTECKUbFK jJCcfg9l0rS+OKGlCSEnyC52UIlU67lrvXiPvUFhyd0VMfVpaSFHe4NYJZbx0iQx AFRcxaCkSLpZU78b4NpSig+qLz4ynLYcyPRXxL+WZpqRrbjaGnPdjkkwaX9hPqzs cGLHMhgS8ImMZK6s67hutTIBXfgYZA7cdu9VzR+zITcssfuxowfCEMZOR/ixaD7+ jYSzS89NTHKhv0cAppu0JWNwC5vIKYu4WBxkRzTjjU8OqsozaSnvoDlQlyfn7Ffb kqbXeJopnMHY1NW8DyazNRtrdArlP/Jw+7gi00S7LVDRlOpboxG9g5NDXhzTzmdP goIHcBuTlWk= =Dfi6 -----END PGP SIGNATURE----- From fabbione at fabbione.net Sat Apr 16 18:35:34 2005 From: fabbione at fabbione.net (Fabio Massimo Di Nitto) Date: Sat, 16 Apr 2005 20:35:34 +0200 Subject: [Linux-cluster] [PATCH] Fix usage of sk_alloc in cman-kernel (2.6.12rc2) In-Reply-To: <20050416083344.1BCE02B9F@trider-g7.fabbione.net> References: <20050416083344.1BCE02B9F@trider-g7.fabbione.net> Message-ID: <42615AF6.20608@fabbione.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Fabio Massimo Di Nitto wrote: > Hi everybody, > > @@ -918,14 +918,14 @@ > return; > } > > -static struct sock *cl_alloc_sock(struct socket *sock, int gfp) > +static struct sock *cl_alloc_sock(struct socket *sock, int gfp, int protocol) > { > struct sock *sk; > struct cluster_sock *c; > > if ((sk = > - sk_alloc(AF_CLUSTER, gfp, sizeof (struct cluster_sock), > - cluster_sk_cachep)) == NULL) > + sk_alloc(AF_CLUSTER, gpf, &cl_proto, > + 1)) == NULL) > goto no_sock; > > if (sock) { Meh.. sorry.. i just realized that i did a typo in this hunk s/gpf/gfp. fabio - -- Self-Service law: The last available dish of the food you have decided to eat, will be inevitably taken from the person in front of you. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCYVr1hCzbekR3nhgRAqMSAJ90YRnv4frDEDyqBSQeJ5xm+1h8/wCfUbny Ed+7DiMfJ0SWD01pv1uCKKA= =h65P -----END PGP SIGNATURE----- From cjkovacs at verizon.net Sun Apr 17 11:01:35 2005 From: cjkovacs at verizon.net (Corey Kovacs) Date: Sun, 17 Apr 2005 07:01:35 -0400 Subject: [Linux-cluster] GFS 6.0.24 hangs machine.... Message-ID: <200504170701.36039.cjkovacs@verizon.net> Hello, I've got a 5 node GFS cluster (RHEL3u4, GFS 6.0.2-24, kernel 2.4.21-24.0.1) with 3 volumes, one of which is approx 500GB and contains several thousand small files. When I do a find on that volume, or slocate is run via it's cron job, or I rsync that volume, the node used to access the volume get's into a state where it cannot fork anymore and nothing can be done with the machine until it is restarted (usually requiring a "fence_node" from another machine). The cluster is configured with 3 of the nodes acting as lock managers, using DL360's with 2GB ram each and qlogic 2342 dual port cards connected to an msa1000. The journals are not on there own volumes and the defaults are used for mounting. Is this a known problem? I've searched for other posts with this problem but have not had any luck with it. Any ideas as to what might be causing this? Thanks Corey From pcaulfie at redhat.com Mon Apr 18 07:56:23 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 18 Apr 2005 08:56:23 +0100 Subject: [Linux-cluster] [PATCH] Fix usage of sk_alloc in cman-kernel (2.6.12rc2) In-Reply-To: <20050416083344.1BCE02B9F@trider-g7.fabbione.net> References: <20050416083344.1BCE02B9F@trider-g7.fabbione.net> Message-ID: <20050418075623.GB6015@tykepenguin.com> On Sat, Apr 16, 2005 at 10:33:44AM +0200, Fabio Massimo Di Nitto wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi everybody, > > the 26th of March 2005 Arnaldo Carvalho de Melo commited a quite big change to > sk_alloc: Thanks. This change is in my tree. I'll commit it with the other 2.6.12pre2 stuff shortly. -- patrick From fabbione at fabbione.net Mon Apr 18 08:00:47 2005 From: fabbione at fabbione.net (Fabio Massimo Di Nitto) Date: Mon, 18 Apr 2005 10:00:47 +0200 Subject: [Linux-cluster] [PATCH] Fix usage of sk_alloc in cman-kernel (2.6.12rc2) In-Reply-To: <20050418075623.GB6015@tykepenguin.com> References: <20050416083344.1BCE02B9F@trider-g7.fabbione.net> <20050418075623.GB6015@tykepenguin.com> Message-ID: <4263692F.5010906@fabbione.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Patrick Caulfield wrote: > On Sat, Apr 16, 2005 at 10:33:44AM +0200, Fabio Massimo Di Nitto wrote: > >>-----BEGIN PGP SIGNED MESSAGE----- >>Hash: SHA1 >> >>Hi everybody, >> >>the 26th of March 2005 Arnaldo Carvalho de Melo commited a quite big change to >>sk_alloc: > > > Thanks. > > This change is in my tree. I'll commit it with the other 2.6.12pre2 stuff > shortly. > Cool, i can confirm that fix works fine on i386 and it builds fine (sorry but i can't test) on ppc/amd64/sparc64/ia64/hppa. Fabio PS is anybody actually building cluster/ with gcc-4.0? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCY2kthCzbekR3nhgRAq2FAJ9sxnTu58c6/zC2VIylubnQvlpa5QCgpOGz wF+dVnTclMMKsTXvqAzeBbc= =aGHF -----END PGP SIGNATURE----- From ptr at poczta.fm Mon Apr 18 13:11:10 2005 From: ptr at poczta.fm (ptr at poczta.fm) Date: 18 Apr 2005 15:11:10 +0200 Subject: [Linux-cluster] Problems after upgrade :( Message-ID: <20050418131110.780E43B21F0@poczta.interia.pl> Hello. I installed the newest CVS version from the scratch (because of noticed problems with old libs and binaries remaining in system directories even after "make install" with certain components. Anyways, now I'm getting those entries as below after node startup and even during normal work: GFS: Trying to join cluster "lock_dlm", "cluster1:eva" scheduling while atomic: cman_comms/0x00000001/8808 [] schedule+0xbc2/0xbd0 [] __wake_up+0x3e/0x60 [] _spin_unlock_irqrestore+0xf/0x30 [] queue_message+0x109/0x120 [cman] [] add_barrier_callback+0x7d/0x160 [cman] [] callback_startdone_barrier_new+0x20/0x30 [cman] [] check_barrier_complete_phase2+0xc7/0x110 [cman] [] process_barrier_msg+0xa5/0x120 [cman] [] process_incoming_packet+0x18f/0x290 [cman] [] receive_message+0xd1/0xf0 [cman] [] cluster_kthread+0x18c/0x340 [cman] [] default_wake_function+0x0/0x20 [] cluster_kthread+0x0/0x340 [cman] [] kernel_thread_helper+0x5/0x10 scheduling while atomic: cman_comms/0x00000001/8808 [] schedule+0xbc2/0xbd0 [] start_ack_timer+0x2e/0x40 [cman] [] add_barrier_callback+0x7d/0x160 [cman] [] callback_startdone_barrier_new+0x20/0x30 [cman] [] check_barrier_complete_phase2+0xc7/0x110 [cman] [] process_barrier_msg+0xa5/0x120 [cman] [] process_incoming_packet+0x18f/0x290 [cman] [] receive_message+0xd1/0xf0 [cman] [] cluster_kthread+0x18c/0x340 [cman] [] default_wake_function+0x0/0x20 [] cluster_kthread+0x0/0x340 [cman] [] kernel_thread_helper+0x5/0x10 Besides, sometimes when I reboot one of the nodes (it's 2-nodes cluster running 2.6.11.7), it won't start up showing on console messages like "CMANsendmsg failed: "-101". I have to reboot again to start the node fully up. Any hints on what's wrong? TIA for your help, best regards Piotr ------------------------------------------------------------------ Teraz na tapecie mamy najwiekszego z silaczy. Sciagnij >> http://link.interia.pl/f1873 << From rajkum2002 at rediffmail.com Mon Apr 18 14:36:42 2005 From: rajkum2002 at rediffmail.com (Raj Kumar) Date: 18 Apr 2005 14:36:42 -0000 Subject: [Linux-cluster] Out of Memory Problem Message-ID: <20050418143642.26172.qmail@webmail47.rediffmail.com> Hi everyone, One of our GFS Linux servers has crashed twice yesterday. The log messages indicate the server ran out of memory and started killing processes: Out of Memory: Killed process 21188 (sshd). Out of Memory: Killed process 5215 (xfs). The server is a HP DL380 with dual Xeon 3.06 GHz processor, 1GB RAM, 2GB swap space running RHEL 3.0- kernel 2.4.21-27.0.1.ELsmp. The server runs NIS, GFS, SSHD and samba services. After the first crash the server didn?t start due to file system corruption. The problem has been corrected and server returned to operation yesterday evening. Today's log indicates the server ran out of memory and killed processes again this morning. This out of memory problem is recurring speciallyl when users are accessing the storage mounted using GFS. Where can I start to debug the problem? free -m output: total used free shared buffers cached Mem: 1001 986 14 0 1 79 -/+ buffers/cache: 905 95 Swap: 1996 49 1946 I don't understand what's happening to the total 1GB memory. This is the free output that happened seconds before crash (I had swatch set up to log the statistics the moment it sees OOM messages). PS output doesn't show any process taking significant portion of memory either. Since this is happening only when users are using GFS heavily I suspect it is the problem. But how do I verify it? Is 1GB too small for a GFS server? I found that another user has seen the same problem before: https://www.redhat.com/archives/linux-cluster/2005-January/msg00099.html GFS setup was fine and all our tests passed. We then moved it to production and it immediately failed after running for two days. Your help is very much appreciated!! The problem seems to be reproducible. So if you need any logs I can rerun what our users did at the time of crash. Thanks, Raj ================== Log ============================ Apr 7 10:49:15 server1 kernel: Mem-info: Apr 7 10:49:15 server1 kernel: Zone:DMA freepages: 2792 min: 0 low: 0 high: 0 Apr 7 10:49:15 server1 kernel: Zone:Normal freepages: 382 min: 766 low: 4031 high: 5791 Apr 7 10:49:15 server1 kernel: Zone:HighMem freepages: 287 min: 255 low: 510 high: 765 Apr 7 10:49:15 server1 kernel: Free pages: 3461 ( 287 HighMem) Apr 7 10:49:15 server1 kernel: ( Active: 22389/6071, inactive_laundry: 889, inactive_clean: 943, free: 3461 ) Apr 7 10:49:15 server1 kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:2792 Apr 7 10:49:15 server1 kernel: aa:6 ac:13 id:292 il:43 ic:0 fr:382 Apr 7 10:49:15 server1 kernel: aa:15159 ac:7211 id:5769 il:856 ic:943 fr:287 Apr 7 10:49:15 server1 kernel: 2*4kB 1*8kB 3*16kB 3*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11168kB) Apr 7 10:49:15 server1 kernel: 0*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1528kB) Apr 7 10:49:15 server1 kernel: 33*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1148kB) Apr 7 10:49:15 server1 kernel: Swap cache: add 1777, delete 1292, find 20425/20536, race 0+0 Apr 7 10:49:15 server1 kernel: 218499 pages of slabcache Apr 7 10:49:15 server1 kernel: 216 pages of kernel stacks Apr 7 10:49:16 server1 kernel: 0 lowmem pagetables, 489 highmem pagetables Apr 7 10:49:16 server1 kernel: Free swap: 2038872kB Apr 7 10:49:16 server1 kernel: 262138 pages of RAM Apr 7 10:49:16 server1 kernel: 32762 pages of HIGHMEM Apr 7 10:49:16 server1 kernel: 5780 reserved pages Apr 7 10:49:16 server1 kernel: 16752 pages shared Apr 7 10:49:16 server1 kernel: 485 pages swap cached Apr 7 10:49:16 server1 kernel: Out of Memory: Killed process 21188 (sshd). Apr 7 10:49:16 server1 kernel: Out of Memory: Killed process 21188 (sshd). Apr 7 10:49:20 server1 kernel: Mem-info: Apr 7 10:49:20 server1 kernel: Zone:DMA freepages: 2792 min: 0 low: 0 high: 0 Apr 7 10:49:20 server1 kernel: Zone:Normal freepages: 382 min: 766 low: 4031 high: 5791 Apr 7 10:49:20 server1 kernel: Zone:HighMem freepages: 291 min: 255 low: 510 high: 765 Apr 7 10:49:20 server1 kernel: Free pages: 3465 ( 291 HighMem) Apr 7 10:49:20 server1 kernel: ( Active: 21743/6636, inactive_laundry: 896, inactive_clean: 1049, free: 3465 ) Apr 7 10:49:20 server1 kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:2792 Apr 7 10:49:20 server1 kernel: aa:6 ac:36 id:265 il:40 ic:0 fr:382 Apr 7 10:49:20 server1 kernel: aa:14479 ac:7222 id:6365 il:862 ic:1049 fr:291 Apr 7 10:49:20 server1 kernel: 2*4kB 1*8kB 3*16kB 3*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11168kB) Apr 7 10:49:20 server1 kernel: 28*4kB 1*8kB 2*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1528kB) Apr 7 10:49:20 server1 kernel: 37*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1164kB) Apr 7 10:49:20 server1 kernel: Swap cache: add 1777, delete 1292, find 20425/20536, race 0+0 Apr 7 10:49:21 server1 kernel: 218570 pages of slabcache Apr 7 10:49:21 server1 kernel: 196 pages of kernel stacks Apr 7 10:49:21 server1 kernel: 0 lowmem pagetables, 404 highmem pagetables Apr 7 10:49:21 server1 kernel: Free swap: 2038872kB Apr 7 10:49:21 server1 kernel: 262138 pages of RAM Apr 7 10:49:21 server1 kernel: 32762 pages of HIGHMEM Apr 7 10:49:21 server1 kernel: 5780 reserved pages Apr 7 10:49:22 server1 kernel: 13904 pages shared Apr 7 10:49:22 server1 kernel: 485 pages swap cached Apr 7 10:49:22 server1 kernel: Out of Memory: Killed process 5215 (xfs). Apr 7 10:49:22 server1 kernel: Out of Memory: Killed process 5215 (xfs). ......... ............ -------------- next part -------------- An HTML attachment was scrubbed... URL: From pcaulfie at redhat.com Mon Apr 18 14:48:12 2005 From: pcaulfie at redhat.com (Patrick Caulfield) Date: Mon, 18 Apr 2005 15:48:12 +0100 Subject: [Linux-cluster] Problems after upgrade :( In-Reply-To: <20050418131110.780E43B21F0@poczta.interia.pl> References: <20050418131110.780E43B21F0@poczta.interia.pl> Message-ID: <20050418144812.GH6015@tykepenguin.com> On Mon, Apr 18, 2005 at 03:11:10PM +0200, ptr at poczta.fm wrote: > Hello. > > I installed the newest CVS version from the scratch > (because of noticed problems with old libs and binaries > remaining in system directories even after "make install" > with certain components. > Anyways, now I'm getting those entries as below after node startup > and even during normal work: Head of CVS is not a good thing to use. Checkout the RHEL4 branch instead. > > Besides, sometimes when I reboot one of the nodes > (it's 2-nodes cluster running 2.6.11.7), it won't start up > showing on console messages like "CMANsendmsg failed: "-101". > I have to reboot again to start the node fully up. > Any hints on what's wrong? > well, -101 is "Network is unreachable" so check that the network is correctly configure and full up before starting cman. -- patrick From mrc at linuxplatform.org Mon Apr 18 15:00:02 2005 From: mrc at linuxplatform.org (Matt) Date: Mon, 18 Apr 2005 11:00:02 -0400 Subject: [Linux-cluster] DB Clustering Question In-Reply-To: <1113571189.6839.6.camel@althea.playway.net> References: <1113571189.6839.6.camel@althea.playway.net> Message-ID: <1113836403.6865.34.camel@althea.playway.net> Thank you to everyone for the replies to my questions about clustering. I'll let you know what option we end up going with. -- Matt From rajkum2002 at rediffmail.com Mon Apr 18 17:04:38 2005 From: rajkum2002 at rediffmail.com (Raj Kumar) Date: 18 Apr 2005 17:04:38 -0000 Subject: [Linux-cluster] Out of Memory Problem Message-ID: <20050418170438.7970.qmail@webmail47.rediffmail.com> Hi everone, cat /proc/slabinfo: size-64 4825410 4825410 128 160847 160847 1 : 1008 252 This seems to be unusal... size-64 slab is consuming upto 643MB of RAM. This number seems to increase slowly... how to track which process is requesting the objects from this slab? Does anyone know if there is a bug related to this in RH 2.4.21-27.0.1.ELsmp kernel? Thank you, Raj ? On Mon, 18 Apr 2005 Raj Kumar wrote : >Hi everyone, > >One of our GFS Linux servers has crashed twice yesterday. The log messages indicate the server ran out of memory and started killing processes: > >Out of Memory: Killed process 21188 (sshd). >Out of Memory: Killed process 5215 (xfs). > >The server is a HP DL380 with dual Xeon 3.06 GHz processor, 1GB RAM, 2GB swap space running RHEL 3.0- kernel 2.4.21-27.0.1.ELsmp. The server runs NIS, GFS, SSHD and samba services. After the first crash the server didn?t start due to file system corruption. The problem has been corrected and server returned to operation yesterday evening. Today's log indicates the server ran out of memory and killed processes again this morning. This out of memory problem is recurring speciallyl when users are accessing the storage mounted using GFS. > >Where can I start to debug the problem? > >free -m output: > > total used free shared buffers cached >Mem: 1001 986 14 0 1 79 >-/+ buffers/cache: 905 95 >Swap: 1996 49 1946 > >I don't understand what's happening to the total 1GB memory. This is the free output that happened seconds before crash (I had swatch set up to log the statistics the moment it sees OOM messages). PS output doesn't show any process taking significant portion of memory either. Since this is happening only when users are using GFS heavily I suspect it is the problem. But how do I verify it? Is 1GB too small for a GFS server? > >I found that another user has seen the same problem before: >https://www.redhat.com/archives/linux-cluster/2005-January/msg00099.html > >GFS setup was fine and all our tests passed. We then moved it to production and it immediately failed after running for two days. Your help is very much appreciated!! The problem seems to be reproducible. So if you need any logs I can rerun what our users did at the time of crash. > >Thanks, >Raj > >================== Log ============================ > >Apr 7 10:49:15 server1 kernel: Mem-info: >Apr 7 10:49:15 server1 kernel: Zone:DMA freepages: 2792 min: 0 low: 0 high: 0 >Apr 7 10:49:15 server1 kernel: Zone:Normal freepages: 382 min: 766 low: 4031 high: 5791 >Apr 7 10:49:15 server1 kernel: Zone:HighMem freepages: 287 min: 255 low: 510 high: 765 >Apr 7 10:49:15 server1 kernel: Free pages: 3461 ( 287 HighMem) >Apr 7 10:49:15 server1 kernel: ( Active: 22389/6071, inactive_laundry: 889, inactive_clean: 943, free: 3461 ) >Apr 7 10:49:15 server1 kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:2792 >Apr 7 10:49:15 server1 kernel: aa:6 ac:13 id:292 il:43 ic:0 fr:382 >Apr 7 10:49:15 server1 kernel: aa:15159 ac:7211 id:5769 il:856 ic:943 fr:287 >Apr 7 10:49:15 server1 kernel: 2*4kB 1*8kB 3*16kB 3*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11168kB) Apr 7 10:49:15 server1 kernel: 0*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1528kB) Apr 7 10:49:15 server1 kernel: 33*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1148kB) Apr 7 10:49:15 server1 kernel: Swap cache: add 1777, delete 1292, find 20425/20536, race 0+0 Apr 7 10:49:15 server1 kernel: 218499 pages of slabcache Apr 7 10:49:15 server1 kernel: 216 pages of kernel stacks Apr 7 10:49:16 server1 kernel: 0 lowmem pagetables, 489 highmem pagetables >Apr 7 10:49:16 server1 kernel: Free swap: 2038872kB >Apr 7 10:49:16 server1 kernel: 262138 pages of RAM Apr 7 10:49:16 server1 kernel: 32762 pages of HIGHMEM Apr 7 10:49:16 server1 kernel: 5780 reserved pages Apr 7 10:49:16 server1 kernel: 16752 pages shared Apr 7 10:49:16 server1 kernel: 485 pages swap cached Apr 7 10:49:16 server1 kernel: Out of Memory: Killed process 21188 (sshd). >Apr 7 10:49:16 server1 kernel: Out of Memory: Killed process 21188 (sshd). >Apr 7 10:49:20 server1 kernel: Mem-info: >Apr 7 10:49:20 server1 kernel: Zone:DMA freepages: 2792 min: 0 low: 0 high: 0 >Apr 7 10:49:20 server1 kernel: Zone:Normal freepages: 382 min: 766 low: 4031 high: 5791 >Apr 7 10:49:20 server1 kernel: Zone:HighMem freepages: 291 min: 255 low: 510 high: 765 >Apr 7 10:49:20 server1 kernel: Free pages: 3465 ( 291 HighMem) >Apr 7 10:49:20 server1 kernel: ( Active: 21743/6636, inactive_laundry: 896, inactive_clean: 1049, free: 3465 ) >Apr 7 10:49:20 server1 kernel: aa:0 ac:0 id:0 il:0 ic:0 fr:2792 >Apr 7 10:49:20 server1 kernel: aa:6 ac:36 id:265 il:40 ic:0 fr:382 >Apr 7 10:49:20 server1 kernel: aa:14479 ac:7222 id:6365 il:862 ic:1049 fr:291 >Apr 7 10:49:20 server1 kernel: 2*4kB 1*8kB 3*16kB 3*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11168kB) Apr 7 10:49:20 server1 kernel: 28*4kB 1*8kB 2*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1528kB) Apr 7 10:49:20 server1 kernel: 37*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1164kB) Apr 7 10:49:20 server1 kernel: Swap cache: add 1777, delete 1292, find 20425/20536, race 0+0 Apr 7 10:49:21 server1 kernel: 218570 pages of slabcache Apr 7 10:49:21 server1 kernel: 196 pages of kernel stacks Apr 7 10:49:21 server1 kernel: 0 lowmem pagetables, 404 highmem pagetables >Apr 7 10:49:21 server1 kernel: Free swap: 2038872kB >Apr 7 10:49:21 server1 kernel: 262138 pages of RAM Apr 7 10:49:21 server1 kernel: 32762 pages of HIGHMEM Apr 7 10:49:21 server1 kernel: 5780 reserved pages Apr 7 10:49:22 server1 kernel: 13904 pages shared Apr 7 10:49:22 server1 kernel: 485 pages swap cached Apr 7 10:49:22 server1 kernel: Out of Memory: Killed process 5215 (xfs). >Apr 7 10:49:22 server1 kernel: Out of Memory: Killed process 5215 (xfs). >......... >............ >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >http://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From CAugustine at overlandstorage.com Mon Apr 18 19:56:05 2005 From: CAugustine at overlandstorage.com (CAugustine at overlandstorage.com) Date: Mon, 18 Apr 2005 12:56:05 -0700 Subject: [Linux-cluster] fence_manual... Message-ID: Hi Everyone, I have a two-node cluster. I have built and installed the cluster sources from cluster_0406282100 snapshot. I can bring up both nodes successfully, however, seems like I have a brain split cluster in that each node thinks it is the only node. I ran the cman_tool on each node as follows: cman_tool join -c OVLCluster -2 -n nodename Furthermore, in the log messages I often see messages that require one of the nodes to be rebooted. Some times I see the message in both nodes' /var/log/messages files. In this case, I reboot the node that needs to be rebooted and run the "fence_ack_maual -s rebooted-nodename" on the other system after the reboot. The problem is that then I see the same messages again on the rebooted node's messages file. Seems like the cluster is in some kind of a loop wanting to reboot that node over and over again after the reboot. Can anyone tell me what is going on? Also, I am running the ccsd, cman_tool, fence_tool, clvmd, vgchange commands by hand. What version of the clustering software has nice scripts such as "cluster start"? Thanks, Caroline ---------------------------------------------------------------------------------------------- See our award-winning line of tape and disk-based backup & recovery solutions at http://www.overlandstorage.com ---------------------------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmarzins at redhat.com Mon Apr 18 21:43:39 2005 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Mon, 18 Apr 2005 16:43:39 -0500 Subject: [Linux-cluster] AW: GNBD multipath with devicemapper? -- possible solution In-Reply-To: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de> References: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de> Message-ID: <20050418214338.GC8789@phlogiston.msp.redhat.com> On Fri, Apr 15, 2005 at 04:24:21PM +0200, Hansjoerg.Maurer at dlr.de wrote: > Hi > > I found a solution for the problem descriped below, > but I am not sure if it is the right way. > > - importing the two gnbd's (wich point to the same device) from two servers > -> /dev/gnbd0 and /dev/gnbd1 on the client > > - creating a multipath device with something like this: > echo "0 167772160 multipath 0 0 1 1 round-robin 0 2 1 251:0 1000 251:1 1000 " | dmsetup create dm0 > (251:0 ist the major:minor id of /dev/gnbd0) > > - mounting the created device > eg: > mount -t gfs /dev/mapper/dm0 /mnt/lvol0 > > If I do a write on /mnt/lvol0 the gnbd_server task on both gnbd_servers start (with a noticeable speedup) > > If one gnbd_server fails dm removes that path with the following log > kernel: device-mapper: dm-multipath: Failing path 251:0. > > I was able to add it again with > > dmsetup message dm0 0 reinstate_path 251:0 > > > I was able to deactivate a path manually with > > dmsetup message dm0 0 fail_path 251:0 > > But I can not unimport the underlying gnbd > > gnbd_import: ERROR cannot disconnect device #1 : Device or resource busy > > > Is there a way to remove a gnbd, which is bunndled in a dm-multipath device? > (might be necessary, if one gnbd server must be rebooted) > > How can I reimport an gnbd on the client in state disconnected? > (I had to manually start > gnbd_recvd -d 0 to do so) > > Is the descriped solution for gnbd multipath the right one? Um... It's a really ugly one. Unfortunately it's the only one that works, since multipath-tools do not currently support non-scsi devices. There are also some bugs in gnbd that make multipathing even more annoying. But to answer your question, in order to remove gnbd, you must first get it out of the multipath table, otherwise dm-multipath will still have it open. To do this, after dmsetup status shows that the path is failed, you run: # echo "0 167772160 multipath 0 0 1 1 round-robin 0 1 1 251:1 1000 " | dmsetup reload dm0 # dmsetup resume dm0 This removes the gnbd from the path. However, if you use the gnbd code from the cvs head, it is no longer necessary to do this to reimport the device. In the stable branch, gnbd_monitor waits until all users close the device before setting it to restartable. In the head code, this happens as soon as the device is successfully fenced. So, if you loose a gnbd server, reboot it, and reexport the device, gnbd_monitor should automatically reimport the device, and you can simply run # dmsetup message dm0 0 reinstate_path 251:0 and you should never need to remove the gnbd device with the method I described above. -Ben > Thank you very much > > Greetings from munich > > Hansj?rg > > > > > > > >Hi > > > >I am trying to set up gnbd with multipath. > >Accoding to the gnbd_usage.txt file, I understand, that this should work with > >dm-multipath. > >But unfortunatly only the gfs part of the setup is descriped there. > > > >Has anybody experiance with this setup, especially how to set up > >multipath with multiple /dev/gnbd* and how to setup the multipath.conf file > > > > > >Thank you very much > > > >Hansj?rg Maurer > -- > _________________________________________________________________ > > Dr. Hansjoerg Maurer | LAN- & System-Manager > | > Deutsches Zentrum | DLR Oberpfaffenhofen > f. Luft- und Raumfahrt e.V. | > Institut f. Robotik | > Postfach 1116 | Muenchner Strasse 20 > 82230 Wessling | 82234 Wessling > Germany | > | > Tel: 08153/28-2431 | E-mail: Hansjoerg.Maurer at dlr.de > Fax: 08153/28-1134 | WWW: http://www.robotic.dlr.de/ > __________________________________________________________________ > > > There are 10 types of people in this world, > those who understand binary and those who don't. > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From bmarzins at redhat.com Mon Apr 18 21:59:24 2005 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Mon, 18 Apr 2005 16:59:24 -0500 Subject: [Linux-cluster] [PATCH] Fix gnbd-kernel build with 2.6.12rc2 In-Reply-To: <20050414144519.546922A8C@trider-g7.fabbione.net> References: <20050414144519.546922A8C@trider-g7.fabbione.net> Message-ID: <20050418215924.GD8789@phlogiston.msp.redhat.com> On Thu, Apr 14, 2005 at 04:45:19PM +0200, Fabio Massimo Di Nitto wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi everybody, > > the following patch fixes compilation of gnbd.c with 2.6.12rc2. > > The i_sock has been recently removed from the inode structure (change > happened in the kernel tree the 1st of April) and made part of i_mode. > > Please apply. Thanks, I'll get it in shortly. -Ben > Signed-off-by: Fabio Massimo Di Nitto > > Index: gnbd-kernel/src/gnbd.c > =================================================================== > RCS file: /cvs/cluster/cluster/gnbd-kernel/src/gnbd.c,v > retrieving revision 1.7 > diff -u -r1.7 gnbd.c > - --- gnbd-kernel/src/gnbd.c 7 Apr 2005 16:19:37 -0000 1.7 > +++ gnbd-kernel/src/gnbd.c 14 Apr 2005 14:30:29 -0000 > @@ -735,7 +735,7 @@ > if (!file) > return error; > inode = file->f_dentry->d_inode; > - - if (!inode->i_sock) { > + if (!S_ISSOCK(inode->i_mode)) { > fput(file); > return error; > } > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.2.5 (GNU/Linux) > > iQIVAwUBQl6BTVA6oBJjVJ+OAQIU7BAAnea52QS9ISXWHXWrrqeEaqFVbm1bSs1A > +BKMycDiSsDKwttb+/bma2V56gjdqnv7//11wv2IiG5lt1q1HebgVTM+ecPMCRBb > 6VsJV2NB+HgjRtcNkbAiw7hLVpcG+WFe5VaFVSsG20B5I47n9ahkF0a8umY4zSbd > O1pCBJA3H4QMiTwlNA8kEj5EBdc3/jB4KCYGwGNhR7m61etZ4JMiEdGlOeQwYMK1 > 4DcXpCgo8aBLACUHGST2e3mnq48ztHHMNI7M0H8BLNrUbhm1EtIEtzyXqJjrS7ku > TNZKKyfjlioAJk4B718ValMMEifZtlxwjlT3FEYfEd7/MUA2sw6ET4arFbDKcGjU > Bn5wdFdoVDZpDwhWICfQq2rVleBydNGCyZ4HYMcI3WBi3RKH21zrLnt5YqL9EA/9 > 9TC8PhD24i8+9rp/kmRV3QtWJtooEO2VSfGKJSDXHoeKkt8S2RTByxuBo5UpBMkI > z/+lB8zlDyF+qvn3TtkaTuJC8fk3clrkQfT+jiI4/7ZztK37NgcCF9Qe1rac3QS4 > VFRTrYJD8hcAOMa40HHCdZTyezetE4N/m6SDOJ+Pps+2KTWYxkJguas0+Aua5yeP > jyyAV3vmKMmPewbNknw1gHoPTI4pz1QUZ89E3hhnmM1Zoi6y4CMzq1ndv/ZqAROx > cS4j9lsnd60= > =+YaG > -----END PGP SIGNATURE----- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > http://www.redhat.com/mailman/listinfo/linux-cluster From teigland at redhat.com Tue Apr 19 02:35:46 2005 From: teigland at redhat.com (David Teigland) Date: Tue, 19 Apr 2005 10:35:46 +0800 Subject: [Linux-cluster] fence_manual... In-Reply-To: References: Message-ID: <20050419023546.GA6559@redhat.com> On Mon, Apr 18, 2005 at 12:56:05PM -0700, CAugustine at overlandstorage.com wrote: > I have a two-node cluster. I have built and installed the cluster > sources from cluster_0406282100 snapshot. Check out code from the RHEL4 branch of cvs and I think you'll have much better luck. -- Dave Teigland From Hansjoerg.Maurer at dlr.de Tue Apr 19 06:37:54 2005 From: Hansjoerg.Maurer at dlr.de (Hansjoerg Maurer) Date: Tue, 19 Apr 2005 08:37:54 +0200 Subject: [Linux-cluster] AW: GNBD multipath with devicemapper? -- possible solution In-Reply-To: <20050418214338.GC8789@phlogiston.msp.redhat.com> References: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de> <20050418214338.GC8789@phlogiston.msp.redhat.com> Message-ID: <4264A742.5020508@dlr.de> Hi thank you for your reply, I have been doing some testing during the weekend and found a better solution In my e-mail from last week I had the following setup - SAN (sda1+sdb1) - 2 Nodes directly attached which form a LVM Stripe set aut of sda1 and sdb1 and export it (the created lvm) via gnbd each - Nodes in the LAN which import the two gnbd's and form a multipath-dm target with round-robin policy It works, but I found a solution wich looks much better. - SAN (sda1+sdb1) - 2 Nodes directly attached which export sda1+sdb1 via gnbd each (sda1 and sdb1 form a striped lvm) - Nodes in the LAN which gnbd-import sda1+sdb1 from each node -> noda_sda1 as gnbd0 -> noda_sdb1 as gnbd1 -> nodb_sda1 as gnbd2 -> nodb_sdb1 as gnbd3 - now I created a failover multipath configuration echo "0 85385412 multipath 0 0 2 1 round-robin 0 1 1 251:0 1000 round-robin 0 1 1 251:2 1000" | dmsetup create dma echo "0 85385412 multipath 0 0 2 1 round-robin 0 1 1 251:3 1000 round-robin 0 1 1 251:1 1000" | dmsetup create dmb In this configuration traffic to sda1 goes primaly to noda and traffic to sdb1 primaly to nodeb. I adapt lvm.conf not to include /dev/gnbd in the search for volumgroups, instead /dev/mapper/dm (I get rid of the duplicate volumgroup with this workaround). After I start clvmd, I can see the Volume on the client. With this solution, I have a speedup of about 50% compared to example one (I think because the stipping is done by the client, whereas in example one the client performs round-robin load-balancing about differnt pathes and the gnbd server stripes on both disks...) With dmsetup message dma 0 disable_group 1 dmsetup message dmb 0 disable_group 2 dmsetup message dma 0 enable_group 1 dmsetup message dmb 0 enable_group 2 I can switch between the two pathes. It will be a bit of work is to get the startup scripts work correctly, because the dmsetup multipath command depends on the major and minor device ID's of the gnbd-devices of the client, which seem not to bee persistent, Will take some time of scripting, in order to abstract it.... :-) I will post it, if I have a solution... The most anoying point is for me at the moment the differnence between gnbd read and write performance. Therefore I am glad, that you as a gnbd-developer answered... In my tests, gnbd write is about two to three times faster the gnbd reads. I tried a lot of things (exporting cached, changing readahead with blockdev command (on the underlying device), changing TCP-IP buffersizes) but I had nor improvement. In the upper example, I get a write speed of about 85MB/s over gnbd and a read speed of about 26 MB/s . (the underlying device's sda and sdb manages about 50MB/s (read and write). Therefore read speed is very good.... First I thougt, it might be related to the strange dm-setup I was running, and therefore I tried it with gnbd-exporting and importing just a single block device (without lvm and dm) but the problem remains... Do I have misconfiguerd something completly (I am using GBEth bonding devices) or can you or anybody else confirm the behavior of much better write than read performance? I was testing with RHEL4 2.6.9-6.38.EL Thank you for your help and your great work... Greetings from a rainy morning in Munich Hansj?rg Benjamin Marzinski wrote: >On Fri, Apr 15, 2005 at 04:24:21PM +0200, Hansjoerg.Maurer at dlr.de wrote: > > >>Hi >> >>I found a solution for the problem descriped below, >>but I am not sure if it is the right way. >> >>- importing the two gnbd's (wich point to the same device) from two servers >>-> /dev/gnbd0 and /dev/gnbd1 on the client >> >>- creating a multipath device with something like this: >>echo "0 167772160 multipath 0 0 1 1 round-robin 0 2 1 251:0 1000 251:1 1000 " | dmsetup create dm0 >> (251:0 ist the major:minor id of /dev/gnbd0) >> >>- mounting the created device >>eg: >>mount -t gfs /dev/mapper/dm0 /mnt/lvol0 >> >>If I do a write on /mnt/lvol0 the gnbd_server task on both gnbd_servers start (with a noticeable speedup) >> >>If one gnbd_server fails dm removes that path with the following log >>kernel: device-mapper: dm-multipath: Failing path 251:0. >> >>I was able to add it again with >> >>dmsetup message dm0 0 reinstate_path 251:0 >> >> >>I was able to deactivate a path manually with >> >>dmsetup message dm0 0 fail_path 251:0 >> >>But I can not unimport the underlying gnbd >> >>gnbd_import: ERROR cannot disconnect device #1 : Device or resource busy >> >> >>Is there a way to remove a gnbd, which is bunndled in a dm-multipath device? >>(might be necessary, if one gnbd server must be rebooted) >> >>How can I reimport an gnbd on the client in state disconnected? >>(I had to manually start >>gnbd_recvd -d 0 to do so) >> >>Is the descriped solution for gnbd multipath the right one? >> >> > >Um... It's a really ugly one. Unfortunately it's the only one that works, since >multipath-tools do not currently support non-scsi devices. > >There are also some bugs in gnbd that make multipathing even more annoying. > >But to answer your question, in order to remove gnbd, you must first get it >out of the multipath table, otherwise dm-multipath will still have it open. > >To do this, after dmsetup status shows that the path is failed, you run: > ># echo "0 167772160 multipath 0 0 1 1 round-robin 0 1 1 251:1 1000 " | dmsetup reload dm0 ># dmsetup resume dm0 > >This removes the gnbd from the path. > >However, if you use the gnbd code from the cvs head, it is no longer necessary >to do this to reimport the device. In the stable branch, gnbd_monitor waits >until all users close the device before setting it to restartable. In the head >code, this happens as soon as the device is successfully fenced. So, if you >loose a gnbd server, reboot it, and reexport the device, gnbd_monitor should >automatically reimport the device, and you can simply run > ># dmsetup message dm0 0 reinstate_path 251:0 > >and you should never need to remove the gnbd device with the method I described >above. > > >-Ben > > > >>Thank you very much >> >>Greetings from munich >> >>Hansj?rg >> >> >> >> >> >> >> >> >>>Hi >>> >>>I am trying to set up gnbd with multipath. >>>Accoding to the gnbd_usage.txt file, I understand, that this should work with >>>dm-multipath. >>>But unfortunatly only the gfs part of the setup is descriped there. >>> >>>Has anybody experiance with this setup, especially how to set up >>>multipath with multiple /dev/gnbd* and how to setup the multipath.conf file >>> >>> >>>Thank you very much >>> >>>Hansj?rg Maurer >>> >>> >>-- >>_________________________________________________________________ >> >>Dr. Hansjoerg Maurer | LAN- & System-Manager >> | >>Deutsches Zentrum | DLR Oberpfaffenhofen >> f. Luft- und Raumfahrt e.V. | >>Institut f. Robotik | >>Postfach 1116 | Muenchner Strasse 20 >>82230 Wessling | 82234 Wessling >>Germany | >> | >>Tel: 08153/28-2431 | E-mail: Hansjoerg.Maurer at dlr.de >>Fax: 08153/28-1134 | WWW: http://www.robotic.dlr.de/ >>__________________________________________________________________ >> >> >>There are 10 types of people in this world, >>those who understand binary and those who don't. >> >> >> >> >>-- >>Linux-cluster mailing list >>Linux-cluster at redhat.com >>http://www.redhat.com/mailman/listinfo/linux-cluster >> >> > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >http://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- _________________________________________________________________ Dr. Hansjoerg Maurer | LAN- & System-Manager | Deutsches Zentrum | DLR Oberpfaffenhofen f. Luft- und Raumfahrt e.V. | Institut f. Robotik | Postfach 1116 | Muenchner Strasse 20 82230 Wessling | 82234 Wessling Germany | | Tel: 08153/28-2431 | E-mail: Hansjoerg.Maurer at dlr.de Fax: 08153/28-1134 | WWW: http://www.robotic.dlr.de/ __________________________________________________________________ There are 10 types of people in this world, those who understand binary and those who don't. From fabbione at fabbione.net Tue Apr 19 07:37:18 2005 From: fabbione at fabbione.net (Fabio Massimo Di Nitto) Date: Tue, 19 Apr 2005 09:37:18 +0200 (CEST) Subject: [Linux-cluster] [PATCH] (cosmetic) do not configure cmirror Message-ID: <20050419073718.6CB662D55@trider-g7.fabbione.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi everybody, just a minor change to configure. cmirror has been commented out from all the targets in the toplevel Makefile, but it is still configured. Skip cmirror configuration, since it is not built anymore. Patch against CVS HEAD 2005-04-19 07:13 UTC. Signed-off-by: Fabio M. Di Nitto Thanks Fabio - --- configure 17 Nov 2004 04:29:09 -0000 1.4 +++ configure 19 Apr 2005 07:29:42 -0000 @@ -45,5 +45,5 @@ echo "configure rgmanager" (cd rgmanager; ./configure $@) - -echo "configure cmirror" - -(cd cmirror; ./configure $@) +#echo "configure cmirror" +#(cd cmirror; ./configure $@) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) iQIVAwUBQmS0s1A6oBJjVJ+OAQLAQxAAuYRnxOrDGcQQphhNbrcmu123bnodeGhI 9dKLaq0rgXjIQ2klhkKd491/waLQbQmoyVHJFYJeCgGQ9HeiVQB11giSZ6eDISWb +AtmBarkYZCIq8DqyHJ/eNSISVr7H6YF91PGEZNgV5oJ8uFmzNz8fhJ/fRyGbJrX BXDqF6+HpwGr0vVlThV8dEXFsStT8Nh9HPYD4IyBsOrwJ0Tl2H9a5GY4EjKyGqTJ WGIIQz2wjizAKy5J8S2uKIrXiDpHN0MprLUa7lqWswIo22/OE03tnF1VqC8y8/4T 3F+IE66/YHBJ+m4G5qWc3qCZGyGJnWKtH24dFENg/TxNrqjB2o0Srbi9tOCy/FYb dEfd3eAVdG8Jpyg02ayRi3aaHQW2/7JO6ELEAVKxUapxNUnfq7c4JoTxIza1Q/gp SDMUf93EBWe123/xJHPBMOzVDPu/dQF1GP5P8FbOR/xfS1jk1YvM1/cmyubzLvyd t1XPQtjSAM+eqxkO+rnjs6vngi0RlezuW08ET3WNWX5JMZgzjxwyRGZ/Q28gK+7a 98cOCGwxYkOjtZtQJeyhS4GNrCpHOeT0ok4KVbcY8w2DUPwz+m7+1U23n/IYZNSC 7SDbfIdbnDoxzn05gvKcx45c4V/yOdHa5wY+EIg+mHAVoF6AunFH8iss3u8N/JrE K7mIbYdvYCk= =NMFk -----END PGP SIGNATURE----- From fabbione at fabbione.net Tue Apr 19 07:41:01 2005 From: fabbione at fabbione.net (Fabio Massimo Di Nitto) Date: Tue, 19 Apr 2005 09:41:01 +0200 (CEST) Subject: [Linux-cluster] [PATCH] (cosmetic) propagate distclean to gfs_fsck Message-ID: <20050419074101.0BE3E2D55@trider-g7.fabbione.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi again, just another cosmetic change. gfs needs to propagate distclean at least to gfs_fsck otherwise there is a bunch of clutter left around :) Patch against CVS HEAD 2005-04-19 07:13 UTC. Signed-off-by: Fabio M. Di Nitto Thanks Fabio - --- gfs/Makefile 31 Mar 2005 05:15:50 -0000 1.4 +++ gfs/Makefile 19 Apr 2005 07:37:44 -0000 @@ -43,6 +43,7 @@ cd gfs_tool && ${MAKE} clean distclean: clean + cd gfs_fsck && ${MAKE} distclean rm -f make/defines.mk install: -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) iQIVAwUBQmS2C1A6oBJjVJ+OAQIeXQ//ZyQOoSV2MN4H/RpgphppqfTKW/nNAFht 0/54Pi/E8JdAppUrxA/sWg93cyr1Qjzsv5eAYtNT/3tuwMUSm1aoJt3tkiCCTP69 yWLf/cu8/FIQKSHHZ6vhGzuTmGb3qYpCZxdLndsctuSZehR64UGQQYYjCc2pwgYV 5c+w07N8Zrl8MTUCYbcTQLDC5IaIwHaPHJDPIClLRte2u4u2rxo0ij1UtQsdgqrG 6VakdEdWMjCdNgzcZpcsmoAmICygQ4sZZkdVYdjKI+8gGts20OS+n4aRMJZ56k2C wqI1gRT0ujK7+mtAt866zcd1lsE0V/rQP8CrXSEl2y0sqfY8RLCSdLpEBMji3f+5 wHmeJxB7rW/NM936BtFBpd15ymhwdnfDIVJCVl/r29GPM390FwGzLL0r05p4wsIo yCUGojKPjq7KJy7Aq1z+CzIxHu2OFwJgI0Qceay4XuvgcF3WYgwKaS8sQ/e7AonJ 6UbtMqlo9JdHHyiMlIqdKnfI+arvU3ftrF59AybG1T+J3/jh2Dlh8bJ8xzXeQAEO O1eRgbRHoaoWw57NNfpFBA0N1bsjBU60nG3Cj8yhDxYxlPCWTw0NXT+9gZdX5H/3 eiTfrFs9+rX96CSNdzG1cIagPY8FeJqNxvL+Pf7W2LshH9A7iKnnzjwbTRDEzA8A jIFgml+0eNY= =Niv8 -----END PGP SIGNATURE----- From birger at birger.sh Tue Apr 19 11:32:41 2005 From: birger at birger.sh (birger) Date: Tue, 19 Apr 2005 13:32:41 +0200 Subject: [Linux-cluster] clusterfs.sh: misleading description of parameter 'options' Message-ID: <4264EC59.8040009@birger.sh> The description of the perameter 'options' in clusterfs.sh talks about doing file system check... It seems to be useable for setting any kind of mount option, so the description is misleading. Same for fs.sh. netfs.sh is ok. -- birger From birger at birger.sh Tue Apr 19 13:08:18 2005 From: birger at birger.sh (birger) Date: Tue, 19 Apr 2005 15:08:18 +0200 Subject: [Linux-cluster] How to set up NFS HA service Message-ID: <426502C2.7030803@birger.sh> Debugging a cluster setup with this software could have been easier given better error messages from the components, but I'm getting there... I thought I'd just mount my gfs file systems outside the resource manager's control to have them present all the time and just use the resource manager to move over the IP address and do the NFS magic. That seems impossible, as I couldn't get any exports to happen when I defined them in cluster.conf without a surrounding . I could define the exports in /etc/exports, but then I would have to synch files. So in the end I put all my gfs file systems into cluster.conf. It almost works. I get mounts, and they get exported. But I have some error messages in the log file and the exports take a loooong time. Only 2 of the 3 exports defined seem to show up. I'm also a bit puzzled about why the file systems don't get unmounted when I disable all services. As for file locking: I copied /etc/init.d/nfslock to /etc/init.d/nfslock-svc and made some changes. First, I added a little code to enable nfslock to read a variable STATD_STATEDIR for the -p option from the config file in /etc/sysconfig. I think this should get propagated back to upcoming fedora releases if someone who knows how would bother to do it... I then changed nfslock-svc to read a different config file (/etc/sysconfig/nfs-svc) and to do 'service nfslock stop' at the top of the start section and 'service nfslock start' at the bottom of the stop section. This enables me to have statd running as e.g. 'server1' on the cluster node until it takes over the nfs service. At takeover, statd gets restarted with statedir on a cluster file system (so it can take over lock info belonging to the service) and with the name of the NFS service IP address. Does this sound reasonable? I know I'll loose any locks the cluster node may have had (as NFS client) when it takes over the nfs service, but I cannot see any reason why the cluster node should have nfs locks (or nfs mounts for that matter) except when doing admin work. I think I could fix it by copying /var/lib/nfs/statd/sm* into the clustered file system right after the 'service nfslock stop' I put in. I have appended part of my messages file and my cluster.conf file. Any help with my NFS export issues will be appreciated. -- birger -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf Type: text/xml Size: 2950 bytes Desc: not available URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: messages URL: From srua at plus.net Tue Apr 19 13:30:03 2005 From: srua at plus.net (Sergio Rua) Date: Tue, 19 Apr 2005 14:30:03 +0100 Subject: [Linux-cluster] getting started Message-ID: <426507DB.2080702@plus.net> Hi, I'm getting started with GFS but I cannot find documentation to get a cluster running. Is there anything I could read? Thanks. -- Sergio Rua From Birger.Wathne at ift.uib.no Mon Apr 18 07:19:29 2005 From: Birger.Wathne at ift.uib.no (Birger Wathne) Date: Mon, 18 Apr 2005 09:19:29 +0200 Subject: [Linux-cluster] Configuration of a 2 node HA cluster with gfs In-Reply-To: <1113577167.20618.155.camel@ayanami.boston.redhat.com> References: <425E2180.6060609@birger.sh> <1113577167.20618.155.camel@ayanami.boston.redhat.com> Message-ID: <42635F81.60801@uib.no> Lon Hohberger wrote: > On Thu, 2005-04-14 at 09:53 +0200, birger wrote: . . . > Fencing is required in order for CMAN to operate in any useful capacity > in 2-node mode. I currently use manual fencing, as the other node in the cluster doesn't exist yet... :-) > > Anyway, to make this short: You probably want fencing for your solution. . . . > With GFS, the file locking should just kind of "work", but the client > would be required to fail over. I don't think the Linux NFS client can > do this, but I believe the Solaris one can... (correct me if I'm wrong > here). When I worked with sun clients they could select between alternative servers at mount, but not fail over from one server to another if the server became unavailable. > With a pure NFS failover solution (ex: on ext3, w/o replicated cluster > locks), there needs to be some changes to nfsd, lockd, and rpc.statd in > order to make lock failover work seamlessly. I once did this on a sun system by stopping statd and merging the contents of /etc/sm* from the failing node to the takeover node and then restarting. This seemed to have statd/lockd rechecking the locks with the clients. I hoped something similar could be done on linux. > You can use rgmanager to do the IP and Samba failover. Take a look at > "rgmanager/src/daemons/tests/*.conf". I don't know how well Samba > failover has been tested. This was a big help! The only documentation I found when searching for rgmanager on the net used instead of . No wonder I couldn't get my services up! I now have my NFS service starting with this block in cluster.conf: