From phillips at redhat.com  Fri Apr  1 00:15:48 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Thu, 31 Mar 2005 19:15:48 -0500
Subject: [Linux-cluster] new dlm control/configuration
In-Reply-To: <20050331082707.GA27996@redhat.com>
References: <20050331071043.GD7190@redhat.com>
	<20050331074607.GL17350@marowsky-bree.de>
	<20050331082707.GA27996@redhat.com>
Message-ID: <200503311915.48154.phillips@redhat.com>

Hi Dave

On Thursday 31 March 2005 03:27, David Teigland wrote:
> ...the mechanism used to export the locking API to user space is
> pretty inconsequential.  We're doing reads/writes on a misc device at
> the moment (used through libdlm of course.)  Going through an fs
> might be better but I'm not sure why.

Please stick with the socket connection on the misc device.  It is 
efficient and simple.  If somebody wants to write a pseudo filesystem 
for it they can go through the socket.

Regards,

Daniel



From phillips at redhat.com  Fri Apr  1 00:21:02 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Thu, 31 Mar 2005 19:21:02 -0500
Subject: [Linux-cluster] new dlm control/configuration
In-Reply-To: <20050331071043.GD7190@redhat.com>
References: <20050331071043.GD7190@redhat.com>
Message-ID: <200503311921.02945.phillips@redhat.com>

Hi Dave,

On Thursday 31 March 2005 02:10, David Teigland wrote:
> A new command line program, dlm_tool, can be used to set up the dlm
> manually in which case it depends on no other software (much like
> using dmsetup with device-mapper.)

Then what would be wrong with calling it dlmsetup?

Regards,

Daniel



From phillips at redhat.com  Fri Apr  1 00:31:45 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Thu, 31 Mar 2005 19:31:45 -0500
Subject: [Linux-cluster] new dlm control/configuration
In-Reply-To: <20050331071043.GD7190@redhat.com>
References: <20050331071043.GD7190@redhat.com>
Message-ID: <200503311931.45491.phillips@redhat.com>

Hi Dave,

On Thursday 31 March 2005 02:10, David Teigland wrote:
> dlm_tool to configure/control the dlm manually:
>
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/dlm/?cvsroot=clu
>ster

re: set_local  <nodeid> <ipaddr> [<weight>]

I know that it's possible to have multiple ip addresses on a given node 
and that the nodeid is not necessarily the hostname.  However, it would 
be very nice to default to this and only need to use the set_local 
command to specify something more exotic.

Regards,

Daniel



From phillips at redhat.com  Fri Apr  1 00:36:23 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Thu, 31 Mar 2005 19:36:23 -0500
Subject: [Linux-cluster] new dlm control/configuration
In-Reply-To: <20050331071043.GD7190@redhat.com>
References: <20050331071043.GD7190@redhat.com>
Message-ID: <200503311936.23400.phillips@redhat.com>

Hi Dave,

On Thursday 31 March 2005 02:10, David Teigland wrote:
> dlm_tool to configure/control the dlm manually:
>
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/dlm/?cvsroot=clu
>ster

re: get_done   <ls_name>

For the most part, the purpose of each of these commands is clear from 
its name, but not this one.  You could cure this by calling it 
"wait_result" or similar.

Regards,

Daniel



From phillips at redhat.com  Fri Apr  1 00:41:57 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Thu, 31 Mar 2005 19:41:57 -0500
Subject: [Linux-cluster] new dlm control/configuration
In-Reply-To: <20050331071043.GD7190@redhat.com>
References: <20050331071043.GD7190@redhat.com>
Message-ID: <200503311941.57259.phillips@redhat.com>

Hi Dave,

On Thursday 31 March 2005 02:10, David Teigland wrote:
> Hi Dave,

> 3. On each node we first need to tell the dlm what the local IP
> address and nodeid are:
>
> nodea> dlm_tool set_local 1 10.0.0.1
> nodeb> dlm_tool set_local 2 10.0.0.2
> nodec> dlm_tool set_local 3 10.0.0.3

But we have a sophisticated messaging system.  Why do we have to tell 
all the local IP addresses to each node?

Regards,

Daniel



From bojan at rexursive.com  Fri Apr  1 02:49:54 2005
From: bojan at rexursive.com (Bojan Smojver)
Date: Fri,  1 Apr 2005 12:49:54 +1000
Subject: [Linux-cluster] LOCK_USE_CLNT
Message-ID: <20050401124954.x9akojwy880sk888@imp.rexursive.com>

In the latest gfs-kernel tarball (gfs-kernel-2.6.9-28), there are still
references to this undefined symbol (apparently removed from 2.6.9-rc4, file
include/linux/fs.h of the kernel).

Is this supposed to exist somewhere? GFS kernel stuff doesn't like to be
compiled without it...

-- 
Bojan



From phillips at redhat.com  Fri Apr  1 05:48:01 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Fri, 1 Apr 2005 00:48:01 -0500
Subject: [Linux-cluster] DDraid benchmarks (epilogue)
In-Reply-To: <200503291016.49403.phillips@redhat.com>
References: <200503141717.19595.phillips@redhat.com>
	<200503291016.49403.phillips@redhat.com>
Message-ID: <200504010048.02097.phillips@redhat.com>

I looked into the cause of the ddraid oops noted in the earlier 
benchmark posting.  It turned out to be just the fact that nothing 
prevents the dm device from being removed while there is still deferred 
timer IO pending.  I filled out the missing benchmark table entries by 
just not removing the device.

the correct fix is probably to teach ddraid's destroy method to wait 
patiently until all the child events complete.  Alternatively, we could 
think about a higher level dm mechanism that understands how to wait 
for pending events other than just IO transfers before calling the 
destroy method.  Or I can just hack this into the destroy method for 
now and think about lifting it up into device mapper later.

Anyway, the missing numbers are for all overhead enabled on the ddraid 
order 2, in other words, the most interesting numbers.  The overheads 
in question are the parity calculations (calc) and the shared 
persistent dirty log (sync).  We see that in this case ddraid finishes 
the tar test dead even with IO to the raw disk.  But ddraid is doing 
more of course, it is running the dirty log, futzing with bio vectors 
and calculating parity on read and write.  So the dirty log is very 
efficient, even in the lots-of-small-transfers case.

In the nonfragmented IO case, ddraid does very well, as before.  Even 
with the dirty logging, ddraid order 2 is more than twice as fast as a 
single raw disk.

--------------------
untar linux-2.6.11.3
--------------------

raw scsi disk
process: real 48.994s user 45.526s sys 3.063s
umount:  real  3.084s user  0.002s sys 0.429s

ddraid order 1, no calc, no sync
process: real 49.942s user 46.328s sys 3.028s
umount:  real  2.034s user  0.005s sys 0.626s

ddraid order 1, calc, no sync
process: real 50.864s user 46.221s sys 3.195s
umount:  real  1.839s user  0.006s sys 1.099s

ddraid order 1, calc, sync
process: real 50.979s user 46.382s sys 3.222s
umount:  real  1.895s user  0.002s sys 0.531s

ddraid order 2, no calc, no sync
process: real 49.532s user 45.837s sys 3.145s
umount:  real  1.318s user  0.004s sys 0.718s

ddraid order 2, calc, no sync
process: real 49.742s user 45.527s sys 3.135s
umount:  real  1.625s user  0.004s sys 1.054s

ddraid order 2, no calc, sync
process: real 50.620s user 46.285s sys 3.122s
umount:  real  1.293s user  0.003s sys 1.103s

ddraid order 2, calc, sync
process: real 50.832s user 46.495s sys 3.084s
umount:  real  1.437s user  0.004s sys 0.787s

---------------------------------
cp /zoo/linux-2.6.11.3.tar.bz2 /x
---------------------------------

raw scsi disk
process: real 0.258s user 0.008s sys 0.236s
umount:  real 1.019s user 0.003s sys 0.032s

raw scsi disk (again)
process: real 0.264s user 0.013s sys 0.237s
umount:  real 1.053s user 0.005s sys 0.029s

raw scsi disk (again)
process: real 0.267s user 0.018s sys 0.233s
umount:  real 1.019s user 0.006s sys 0.028s

ddraid order 1, calc, no sync
process: real 0.267s user 0.007s sys 0.243s
umount:  real 0.568s user 0.006s sys 0.250s

ddraid order 1, no calc, sync
process: real 0.267s user 0.011s sys 0.240s
umount:  real 0.608s user 0.002s sys 0.032s

ddraid order 1, calc, sync
process: real 0.265s user 0.008s sys 0.239s
umount:  real 0.596s user 0.004s sys 0.042s

ddraid order 2, no calc, no sync
process: real 0.266s user 0.013s sys 0.234s
umount:  real 0.381s user 0.004s sys 0.049s

ddraid order 2, calc, no sync
process: real 0.269s user 0.010s sys 0.239s
umount:  real 0.392s user 0.004s sys 0.201s

ddraid order 2, no calc, sync
process: real 0.261s user 0.004s sys  0.244s
umount:  real 0.437s user 0.003s sys  0.195s

ddraid order 2, calc, sync
process: real 0.266s user 0.009s sys 0.240s
umount:  real 0.441s user 0.007s sys 0.026s



From patrick at tykepenguin.com  Fri Apr  1 10:51:44 2005
From: patrick at tykepenguin.com (Patrick Caulfield)
Date: Fri, 1 Apr 2005 11:51:44 +0100
Subject: [Linux-cluster] new dlm control/configuration
In-Reply-To: <20050331215036.GE1334@ca-server1.us.oracle.com>
References: <20050331071043.GD7190@redhat.com>
	<20050331074607.GL17350@marowsky-bree.de>
	<20050331082707.GA27996@redhat.com>
	<20050331083751.GB23452@tykepenguin.com>
	<20050331215036.GE1334@ca-server1.us.oracle.com>
Message-ID: <20050401105143.GA8720@tykepenguin.com>

On Thu, Mar 31, 2005 at 01:50:36PM -0800, Mark Fasheh wrote:
> Well it's actually quite clean in ocfs2_dlmfs, part of that is likely
> related to some design calls we made early on to simplify our userspace
> locking. We don't do ranges (anywhere really), and we consider all userspace
> lock requests to be synchronous. This does however result in a userspace API
> which is extremely lightweight and dirt simple to use.
> 
> mkdir gives you a new domain, files created within that directory correspond
> to lock resource with the same name. Open O_RDONLY gets you a PR mode lock,
> open RDWR gives you an EX mode lock. You can do NOQUEUE (trylock) ops with
> O_NONBLOCK. Reads and writes to the file return and set the LVB accordingly.
> 
> One can literally, create a domain, create locks within it and ship data via
> the LVB all from a bash shell on my cluster nodes.
> 
> I was able to write a trivial library wrapper (for those who don't want to
> use shell for controlling dlm functionality) in about 600 lines.
> 	--Mark
> 

That's interesting, thanks. As far as our DLM is concerned it's a very small
subset of the full functionality (so it would never replace the existing device
interface) but I can see it might be useful.

-- 

patrick



From bastian at waldi.eu.org  Fri Apr  1 14:20:31 2005
From: bastian at waldi.eu.org (Bastian Blank)
Date: Fri, 1 Apr 2005 16:20:31 +0200
Subject: [Linux-cluster] [PATCH] iddev - convert to general purpose device
	identifier
Message-ID: <20050401142031.GA21976@wavehammer.waldi.eu.org>

Hi folks

iddev currently only returns a human readable string of the device
content. This makes it rather unusable in contexts where you need to
have to do decitions on the content.

Changes:
- Add three enums which describe the content.
- identify_device get a struct which contains the enums and the human
  readable text.
- Get the block size. The struct also contains a field for the uuid, but
  it is not set yet.
- Add xfs support.
- Make the ext2/3 check differ between ext2 and ext3.

I currently try to do some cleanups in lvm2/fsadm. This tool currently
relies on an entry in /etc/fstab to get the filesystem type and a
temporary mount to get the block size. With this changes I can just use
iddev to gather the informations.

Bastian

-- 
Conquest is easy. Control is not.
		-- Kirk, "Mirror, Mirror", stardate unknown
-------------- next part --------------
Index: lib/iddev.h
===================================================================
--- lib/iddev.h	(revision 413)
+++ lib/iddev.h	(working copy)
@@ -16,19 +16,64 @@
 
 
 /**
+ * device_info - 
+ */
+
+enum device_info_family
+{
+	DEVICE_INFO_UNDEFINED_FAMILY = 0,
+	DEVICE_INFO_CONTAINER,
+	DEVICE_INFO_FILESYSTEM,
+	DEVICE_INFO_SWAP,
+};
+
+enum device_info_type
+{
+	DEVICE_INFO_UNDEFINED_TYPE = 0,
+	DEVICE_INFO_CONTAINER_CCA,
+	DEVICE_INFO_CONTAINER_CIDEV,
+	DEVICE_INFO_CONTAINER_LVM1,
+	DEVICE_INFO_CONTAINER_LVM2,
+	DEVICE_INFO_CONTAINER_PARTITION,
+	DEVICE_INFO_CONTAINER_POOL,
+	DEVICE_INFO_FILESYSTEM_EXT23,
+	DEVICE_INFO_FILESYSTEM_GFS,
+	DEVICE_INFO_FILESYSTEM_REISERFS,
+	DEVICE_INFO_FILESYSTEM_XFS,
+};
+
+enum device_info_subtype
+{
+	DEVICE_INFO_UNDEFINED_SUBTYPE = 0,
+	DEVICE_INFO_CONTAINER_PARTITION_MSDOS,
+	DEVICE_INFO_FILESYSTEM_EXT2,
+	DEVICE_INFO_FILESYSTEM_EXT3,
+};
+
+struct device_info
+{
+	enum device_info_family family;
+	enum device_info_type type;
+	enum device_info_subtype subtype;
+
+	char display[128];
+	unsigned char uuid[16];
+	size_t block_size;
+};
+
+/**
  * indentify_device - figure out what's on a device
  * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
+ * @info: a buffer
  *
  * The offset of @fd will be changed by the function.
  * This routine will not write to this device.
  *
  * Returns: -1 on error (with errno set), 1 if unabled to identify,
- *          0 if device identified (with @type set)
+ *          0 if device identified (with @info set)
  */
 
-int identify_device(int fd, char *type, unsigned type_len);
+int identify_device(int fd, struct device_info *info);
 
 
 /**
Index: lib/identify_device.c
===================================================================
--- lib/identify_device.c	(revision 413)
+++ lib/identify_device.c	(working copy)
@@ -50,7 +50,7 @@
 int main(int argc, char *argv[])
 {
   int fd;
-  char buf[BUFSIZE];
+  struct device_info info;
   uint64 bytes;
   int error;
 
@@ -63,18 +63,18 @@
   if (fd < 0)
     die("can't open %s: %s\n", argv[1], strerror(errno));
 
-  error = identify_device(fd, buf, BUFSIZE);
+  error = identify_device(fd, &info);
   if (error < 0)
     die("error identifying the contents of %s: %s\n", argv[1], strerror(errno));
   else if (error)
-    strcpy(buf, "unknown");
+    strcpy(info.display, "unknown");
 
   error = device_size(fd, &bytes);
   if (error < 0)
     die("error determining the size of %s: %s\n", argv[1], strerror(errno));
 
   printf("%s:\n%-15s%s\n%-15s%"PRIu64"\n",
-	 argv[1], "  contents:", buf, "  bytes:", bytes);
+	 argv[1], "  contents:", info.display, "  bytes:", bytes);
 
   close(fd);
 
Index: lib/iddev.c
===================================================================
--- lib/iddev.c	(revision 413)
+++ lib/iddev.c	(working copy)
@@ -25,24 +25,37 @@
 
 #include "iddev.h"
 
+static void info_set_display(struct device_info *info, const char *display)
+{
+  snprintf(info->display, sizeof (info->display), display);
+}
 
+static inline void info_set(struct device_info *info, const enum device_info_family family, const enum device_info_type type, const enum device_info_subtype subtype, const char *display)
+{
+  info->family = family;
+  info->type = type;
+  info->subtype = subtype;
+  info_set_display(info, display);
+}
 
+static inline void info_set_container(struct device_info *info, const enum device_info_type type, const enum device_info_subtype subtype, const char *display)
+{
+  info_set(info, DEVICE_INFO_CONTAINER, type, subtype, display);
+}
 
+static inline void info_set_filesystem(struct device_info *info, const enum device_info_type type, const enum device_info_subtype subtype, const char *display)
+{
+  info_set(info, DEVICE_INFO_FILESYSTEM, type, subtype, display);
+}
 
+typedef int check(int fd, struct device_info *info);
+
 /**
  * check_for_gfs - check to see if GFS is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * An EINVAL returned from lseek means that the device was too
- * small -- at least on Linux.
- *
- * Returns: -1 on error (with errno set), 1 if not GFS,
- *          0 if GFS found (with type set)
  */
 
-static int check_for_gfs(int fd, char *type, unsigned type_len)
+static check check_for_gfs;
+static int check_for_gfs(int fd, struct device_info *info)
 {
   unsigned char buf[512];
   uint32 *p = (uint32 *)buf;
@@ -66,7 +79,7 @@
   if (osi_be32_to_cpu(*p) != 0x01161970 || osi_be32_to_cpu(*(p + 1)) != 1)
     return 1;
 
-  snprintf(type, type_len, "GFS filesystem");
+  info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_GFS, 0, "GFS filesystem");
 
   return 0;
 }
@@ -74,15 +87,10 @@
 
 /**
  * check_for_pool - check to see if Pool is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not Pool,
- *          0 if Pool found (with type set)
  */
 
-static int check_for_pool(int fd, char *type, unsigned type_len)
+static check check_for_pool;
+static int check_for_pool(int fd, struct device_info *info)
 {
   unsigned char buf[512];
   uint64 *p = (uint64 *)buf;
@@ -106,23 +114,18 @@
   if (osi_be64_to_cpu(*p) != 0x11670)
     return 1;
 
-  snprintf(type, type_len, "Pool subdevice");
+  info_set_container(info, DEVICE_INFO_CONTAINER_POOL, 0, "Pool subdevice");
 
   return 0;
 }
 
 
 /**
- * check_for_paritition - check to see if Partition is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not Partition,
- *          0 if Partition found (with type set)
+ * check_for_msdos - check to see if Partition is on this device
  */
 
-static int check_for_partition(int fd, char *type, unsigned type_len)
+static check check_for_partition_msdos;
+static int check_for_partition_msdos(int fd, struct device_info *info)
 {
   unsigned char buf[512];
   int error;
@@ -145,29 +148,42 @@
   if (buf[510] != 0x55 || buf[511] != 0xAA)
     return 1;
 
-  snprintf(type, type_len, "partition information");
+  info_set_container(info, DEVICE_INFO_CONTAINER_PARTITION, DEVICE_INFO_CONTAINER_PARTITION_MSDOS, "MSDOS partition information");
 
   return 0;
 }
 
+enum
+{
+  BLOCK_SIZE_BITS = 10,
+  BLOCK_SIZE = (1 << BLOCK_SIZE_BITS),
+  EXT3_SUPER_MAGIC = 0xEF53,
+  EXT23_FEATURE_COMPAT_HAS_JOURNAL = 0x4,
+};
 
+struct ext23_superblock
+{
+  uint32_t _r1[6];                      /**< 0x00 - 0x14 */
+  uint32_t s_log_block_size;            /**< 0x18 */
+  uint32_t _r2[7];                      /**< 0x1c - 0x34 */
+  uint16_t s_magic;                     /**< 0x38 */
+  uint16_t s_state;                     /**< 0x3a */
+  uint32_t _r3[8];                      /**< 0x3c - 0x58 */
+  uint32_t s_feature_compat;            /**< 0x5c */
+  uint32_t s_feature_incompat;          /**< 0x60 */
+  uint32_t s_feature_ro_compat;         /**< 0x64 */
+  uint8_t s_uuid[16];                   /**< 0x68 - 0x77 */
+};
+
 /**
  * check_for_ext23 - check to see if EXT23 is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * An EINVAL returned from lseek means that the device was too
- * small -- at least on Linux.
- *
- * Returns: -1 on error (with errno set), 1 if not EXT23,
- *          0 if EXT23 found (with type set)
  */
 
-static int check_for_ext23(int fd, char *type, unsigned type_len)
+static check check_for_ext23;
+static int check_for_ext23(int fd, struct device_info *info)
 {
   unsigned char buf[512];
-  uint16 *p = (uint16 *)buf;
+  struct ext23_superblock *p = (struct ext23_superblock *)buf;
   int error;
 
   error = lseek(fd, 1024, SEEK_SET);
@@ -185,26 +201,78 @@
   else if (error < 58)
     return 1;
 
-  if (osi_le16_to_cpu(p[28]) != 0xEF53)
+  if (osi_le16_to_cpu(p->s_magic) != EXT3_SUPER_MAGIC)
     return 1;
 
-  snprintf(type, type_len, "EXT2/3 filesystem");
+  info->block_size = (BLOCK_SIZE << osi_le32_to_cpu(p->s_log_block_size));
 
+  if (osi_le16_to_cpu(p->s_feature_compat) & EXT23_FEATURE_COMPAT_HAS_JOURNAL)
+    info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_EXT23, DEVICE_INFO_FILESYSTEM_EXT3, "EXT3 filesystem");
+  else
+    info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_EXT23, DEVICE_INFO_FILESYSTEM_EXT2, "EXT2 filesystem");
+
   return 0;
 }
 
 
+enum
+{
+  XFS_SB_MAGIC = 0x58465342,
+};
+
+struct xfs_superblock
+{
+  uint32_t sb_magicnum;
+  uint32_t sb_blocksize;
+  uint64_t sb_dblocks;
+  uint64_t sb_rblocks;
+  uint64_t sb_rextents;
+  uint8_t sb_uuid[16];
+};
+
 /**
+ * check_for_xfs - check to see if XFS is on this device
+ */
+
+static check check_for_xfs;
+static int check_for_xfs(int fd, struct device_info *info)
+{
+  unsigned char buf[512];
+  struct xfs_superblock *p = (struct xfs_superblock *)buf;
+  int error;
+
+  error = lseek(fd, 0, SEEK_SET);
+  if (error < 0)
+    return (errno == EINVAL) ? 1 : error;
+  else if (error != 0)
+  {
+    errno = EINVAL;
+    return -1;
+  }
+
+  error = read(fd, buf, 512);
+  if (error < 0)
+    return error;
+  else if (error < 58)
+    return 1;
+
+  if (osi_be32_to_cpu(p->sb_magicnum) != XFS_SB_MAGIC)
+    return 1;
+
+  info->block_size = osi_be32_to_cpu(p->sb_blocksize);
+
+  info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_XFS, 0, "XFS filesystem");
+
+  return 0;
+}
+
+
+/**
  * check_for_swap - check to see if SWAP is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not SWAP,
- *          0 if SWAP found (with type set)
  */
 
-static int check_for_swap(int fd, char *type, unsigned type_len)
+static check check_for_swap;
+static int check_for_swap(int fd, struct device_info *info)
 {
   unsigned char buf[8192];
   int error;
@@ -227,7 +295,7 @@
   if (memcmp(buf + 4086, "SWAP-SPACE", 10) && memcmp(buf + 4086, "SWAPSPACE2", 10))
     return 1;
 
-  snprintf(type, type_len, "swap device");
+  info_set(info, DEVICE_INFO_SWAP, 0, 0, "swap device");
 
   return 0;
 }
@@ -235,15 +303,10 @@
 
 /**
  * check_for_lvm1 - check to see if LVM1 is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not LVM1,
- *          0 if LVM1 found (with type set)
  */
 
-static int check_for_lvm1(int fd, char *type, unsigned type_len)
+static check check_for_lvm1;
+static int check_for_lvm1(int fd, struct device_info *info)
 {
   unsigned char buf[512];
   int error;
@@ -266,7 +329,7 @@
   if (buf[0] != 'H' || buf[1] != 'M')
     return 1;
 
-  snprintf(type, type_len, "lvm1 subdevice");
+  info_set_container(info, DEVICE_INFO_CONTAINER_LVM1, 0, "LVM1 subdevice");
 
   return 0;
 }
@@ -274,15 +337,10 @@
 
 /**
  * check_for_lvm2 - check to see if LVM2 is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not LVM2,
- *          0 if LVM1 found (with type set)
  */
 
-static int check_for_lvm2(int fd, char *type, unsigned type_len)
+static check check_for_lvm2;
+static int check_for_lvm2(int fd, struct device_info *info)
 {
   unsigned char buf[512];
   int error;
@@ -315,7 +373,7 @@
     if (strncmp(&buf[24], "LVM2 001", 8) != 0)
       continue;
 
-    snprintf(type, type_len, "lvm2 subdevice");
+    info_set_container(info, DEVICE_INFO_CONTAINER_LVM2, 0, "LVM1 subdevice");
 
     return 0;
   }
@@ -326,15 +384,10 @@
 
 /**
  * check_for_cidev - check to see if CIDEV is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not CIDEV,
- *          0 if CIDEV found (with type set)
  */
 
-static int check_for_cidev(int fd, char *type, unsigned type_len)
+static check check_for_cidev;
+static int check_for_cidev(int fd, struct device_info *info)
 {
   unsigned char buf[512];
   uint32 *p = (uint32 *)buf;
@@ -358,7 +411,7 @@
   if (osi_be32_to_cpu(*p) != 0x47465341)
     return 1;
 
-  snprintf(type, type_len, "CIDEV");
+  info_set_container(info, DEVICE_INFO_CONTAINER_CIDEV, 0, "CIDEV");
 
   return 0;
 }
@@ -366,15 +419,10 @@
 
 /**
  * check_for_cca - check to see if CCA is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not CCA,
- *          0 if CCA found (with type set)
  */
 
-static int check_for_cca(int fd, char *type, unsigned type_len)
+static check check_for_cca;
+static int check_for_cca(int fd, struct device_info *info)
 {
   unsigned char buf[512];
   uint32 *p = (uint32 *)buf;
@@ -398,7 +446,7 @@
   if (osi_be32_to_cpu(*p) != 0x122473)
     return 1;
 
-  snprintf(type, type_len, "CCA device");
+  info_set_container(info, DEVICE_INFO_CONTAINER_CCA, 0, "CCA device");
 
   return 0;
 }
@@ -406,15 +454,10 @@
 
 /**
  * check_for_reiserfs - check to see if reisterfs is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not reiserfs,
- *          0 if CCA found (with type set)
  */
 
-static int check_for_reiserfs(int fd, char *type, unsigned type_len)
+static check check_for_reiserfs;
+static int check_for_reiserfs(int fd, struct device_info *info)
 {
   unsigned int pass;
   uint64 offset;
@@ -444,7 +487,7 @@
 	strncmp(buf + 52, "ReIsEr2Fs", 9) == 0 ||
 	strncmp(buf + 52, "ReIsEr3Fs", 9) == 0)
     {
-      snprintf(type, type_len, "Reiserfs filesystem");
+      info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_REISERFS, 0, "ReiserFS filesystem");
       return 0;
     }
   }
@@ -453,69 +496,40 @@
 }
 
 
-/**
- * identify_device - figure out what's on a device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * The offset of @fd will be changed by this function.
- * This routine will not write to the device.
- *
- * Returns: -1 on error (with errno set), 1 if unabled to identify,
- *          0 if device identified (with type set)
- */
+static check *checks[] =
+{
+  check_for_partition_msdos,
+  check_for_pool,
+  check_for_lvm1,
+  check_for_lvm2,
+  check_for_cidev,
+  check_for_cca,
+  check_for_ext23,
+  check_for_gfs,
+  check_for_reiserfs,
+  check_for_xfs,
+  check_for_swap,
+};
 
-int identify_device(int fd, char *type, unsigned type_len)
+int identify_device(int fd, struct device_info *info)
 {
-  int error;
+  int i;
 
-  if (!type || !type_len)
+  if (!info)
   {
     errno = EINVAL;
     return -1;
   }
 
-  error = check_for_pool(fd, type, type_len);
-  if (error <= 0)
-    return error;
+  memset(info, sizeof (struct device_info), 0);
 
-  error = check_for_lvm1(fd, type, type_len);
-  if (error <= 0)
-    return error;
+  for (i = 0; i < sizeof (checks) / sizeof (*checks); ++i)
+  {
+    int error = checks[i](fd, info);
+    if (error <= 0)
+      return error;
+  }
 
-  error = check_for_lvm2(fd, type, type_len);
-  if(error <= 0)
-    return error;
-
-  error = check_for_cidev(fd, type, type_len);
-  if (error <= 0)
-    return error;
-
-  error = check_for_cca(fd, type, type_len);
-  if (error <= 0)
-    return error;
-
-  error = check_for_gfs(fd, type, type_len);
-  if (error <= 0)
-    return error;
-
-  error = check_for_ext23(fd, type, type_len);
-  if (error <= 0)
-    return error;
-
-  error = check_for_reiserfs(fd, type, type_len);
-  if (error <= 0)
-    return error;
-
-  error = check_for_swap(fd, type, type_len);
-  if (error <= 0)
-    return error;
-
-  error = check_for_partition(fd, type, type_len);
-  if (error <= 0)
-    return error;
-
   return 1;
 }
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050401/da52aee0/attachment.sig>

From agk at redhat.com  Fri Apr  1 15:10:01 2005
From: agk at redhat.com (Alasdair G Kergon)
Date: Fri, 1 Apr 2005 16:10:01 +0100
Subject: [Linux-cluster] [PATCH] iddev - convert to general purpose device
	identifier
In-Reply-To: <20050401142031.GA21976@wavehammer.waldi.eu.org>
References: <20050401142031.GA21976@wavehammer.waldi.eu.org>
Message-ID: <20050401151001.GB14307@agk.surrey.redhat.com>

On Fri, Apr 01, 2005 at 04:20:31PM +0200, Bastian Blank wrote:
> I currently try to do some cleanups in lvm2/fsadm. This tool currently
> relies on an entry in /etc/fstab to get the filesystem type and a
> temporary mount to get the block size. With this changes I can just use
> iddev to gather the informations.
 
Also, if online and offline resizers are both available, fsadm
should choose whichever is most appropriate according to whether
the filesystem is already mounted or not.

Alasdair
-- 
agk at redhat.com



From bastian at waldi.eu.org  Fri Apr  1 16:02:10 2005
From: bastian at waldi.eu.org (Bastian Blank)
Date: Fri, 1 Apr 2005 18:02:10 +0200
Subject: [Linux-cluster] [PATCH] iddev - convert to general purpose device
	identifier
In-Reply-To: <20050401151001.GB14307@agk.surrey.redhat.com>
References: <20050401142031.GA21976@wavehammer.waldi.eu.org>
	<20050401151001.GB14307@agk.surrey.redhat.com>
Message-ID: <20050401160210.GA22564@wavehammer.waldi.eu.org>

On Fri, Apr 01, 2005 at 04:10:01PM +0100, Alasdair G Kergon wrote:
> Also, if online and offline resizers are both available, fsadm
> should choose whichever is most appropriate according to whether
> the filesystem is already mounted or not.

Should be no real problem.

Hmm, the two weeks are over and I don't got a statement to my patch.

Bastian

-- 
She won' go Warp 7, Cap'n!  The batteries are dead!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050401/0257d732/attachment.sig>

From bastian at waldi.eu.org  Fri Apr  1 16:07:03 2005
From: bastian at waldi.eu.org (Bastian Blank)
Date: Fri, 1 Apr 2005 18:07:03 +0200
Subject: [Linux-cluster] [PATCH] iddev - convert to general purpose device
	identifier
In-Reply-To: <20050401142031.GA21976@wavehammer.waldi.eu.org>
References: <20050401142031.GA21976@wavehammer.waldi.eu.org>
Message-ID: <20050401160703.GB22564@wavehammer.waldi.eu.org>

Updated patch.

Changes:
- Use mmap.
- Add the device size to the exported information.

Bastian

-- 
A princess should not be afraid -- not with a brave knight to protect her.
		-- McCoy, "Shore Leave", stardate 3025.3
-------------- next part --------------
=== lib/iddev.h
==================================================================
--- lib/iddev.h   (/iddev/trunk)   (revision 29)
+++ lib/iddev.h   (/iddev/local/branches/refactor)   (revision 29)
@@ -16,19 +16,65 @@
 
 
 /**
+ * device_info - 
+ */
+
+enum device_info_family
+{
+	DEVICE_INFO_UNDEFINED_FAMILY = 0,
+	DEVICE_INFO_CONTAINER,
+	DEVICE_INFO_FILESYSTEM,
+	DEVICE_INFO_SWAP,
+};
+
+enum device_info_type
+{
+	DEVICE_INFO_UNDEFINED_TYPE = 0,
+	DEVICE_INFO_CONTAINER_CCA,
+	DEVICE_INFO_CONTAINER_CIDEV,
+	DEVICE_INFO_CONTAINER_LVM1,
+	DEVICE_INFO_CONTAINER_LVM2,
+	DEVICE_INFO_CONTAINER_PARTITION,
+	DEVICE_INFO_CONTAINER_POOL,
+	DEVICE_INFO_FILESYSTEM_EXT23,
+	DEVICE_INFO_FILESYSTEM_GFS,
+	DEVICE_INFO_FILESYSTEM_REISERFS,
+	DEVICE_INFO_FILESYSTEM_XFS,
+};
+
+enum device_info_subtype
+{
+	DEVICE_INFO_UNDEFINED_SUBTYPE = 0,
+	DEVICE_INFO_CONTAINER_PARTITION_MSDOS,
+	DEVICE_INFO_FILESYSTEM_EXT2,
+	DEVICE_INFO_FILESYSTEM_EXT3,
+};
+
+struct device_info
+{
+	enum device_info_family family;
+	enum device_info_type type;
+	enum device_info_subtype subtype;
+
+	char display[128];
+	unsigned char uuid[16];
+	uint64_t device_size;
+	uint32_t block_size;
+};
+
+/**
  * indentify_device - figure out what's on a device
  * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
+ * @info: a buffer
  *
  * The offset of @fd will be changed by the function.
  * This routine will not write to this device.
  *
  * Returns: -1 on error (with errno set), 1 if unabled to identify,
- *          0 if device identified (with @type set)
+ *          0 if device identified (with @info set)
  */
 
-int identify_device(int fd, char *type, unsigned type_len);
+int identify_device(int fd, struct device_info *info);
 
 
 /**
@@ -39,7 +85,7 @@
  * Returns: -1 on error (with errno set), 0 on success (with @bytes set)
  */
 
-int device_size(int fd, uint64 *bytes);
+int device_size(int fd, uint64_t *bytes);
 
 
 #endif /* __IDDEV_DOT_H__ */
=== lib/identify_device.c
==================================================================
--- lib/identify_device.c   (/iddev/trunk)   (revision 29)
+++ lib/identify_device.c   (/iddev/local/branches/refactor)   (revision 29)
@@ -50,8 +50,8 @@
 int main(int argc, char *argv[])
 {
   int fd;
-  char buf[BUFSIZE];
-  uint64 bytes;
+  struct device_info info;
+  const char *display;
   int error;
 
   prog_name = argv[0];
@@ -63,18 +63,16 @@
   if (fd < 0)
     die("can't open %s: %s\n", argv[1], strerror(errno));
 
-  error = identify_device(fd, buf, BUFSIZE);
+  error = identify_device(fd, &info);
   if (error < 0)
     die("error identifying the contents of %s: %s\n", argv[1], strerror(errno));
   else if (error)
-    strcpy(buf, "unknown");
+    display = "unknown";
+  else
+    display = info.display;
 
-  error = device_size(fd, &bytes);
-  if (error < 0)
-    die("error determining the size of %s: %s\n", argv[1], strerror(errno));
-
   printf("%s:\n%-15s%s\n%-15s%"PRIu64"\n",
-	 argv[1], "  contents:", buf, "  bytes:", bytes);
+	 argv[1], "  contents:", display, "  bytes:", info.device_size);
 
   close(fd);
 
=== lib/iddev.c
==================================================================
--- lib/iddev.c   (/iddev/trunk)   (revision 29)
+++ lib/iddev.c   (/iddev/local/branches/refactor)   (revision 29)
@@ -14,6 +14,7 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
+#include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <fcntl.h>
@@ -25,48 +26,53 @@
 
 #include "iddev.h"
 
+static void info_set_display(struct device_info *info, const char *display)
+{
+  snprintf(info->display, sizeof (info->display), display);
+}
 
+static inline void info_set(struct device_info *info, const enum device_info_family family, const enum device_info_type type, const enum device_info_subtype subtype, const char *display)
+{
+  info->family = family;
+  info->type = type;
+  info->subtype = subtype;
+  info_set_display(info, display);
+}
 
+static inline void info_set_container(struct device_info *info, const enum device_info_type type, const enum device_info_subtype subtype, const char *display)
+{
+  info_set(info, DEVICE_INFO_CONTAINER, type, subtype, display);
+}
 
+static inline void info_set_filesystem(struct device_info *info, const enum device_info_type type, const enum device_info_subtype subtype, const char *display)
+{
+  info_set(info, DEVICE_INFO_FILESYSTEM, type, subtype, display);
+}
 
+typedef int check(const void *mem, size_t len, struct device_info *info);
+
 /**
  * check_for_gfs - check to see if GFS is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * An EINVAL returned from lseek means that the device was too
- * small -- at least on Linux.
- *
- * Returns: -1 on error (with errno set), 1 if not GFS,
- *          0 if GFS found (with type set)
  */
 
-static int check_for_gfs(int fd, char *type, unsigned type_len)
+enum
 {
-  unsigned char buf[512];
-  uint32 *p = (uint32 *)buf;
-  int error;
+  GFS_OFFSET = 64*1024,
+  GFS_SB_SIZE = 512,
+};
 
-  error = lseek(fd, 65536, SEEK_SET);
-  if (error < 0)
-    return (errno == EINVAL) ? 1 : error;
-  else if (error != 65536)
-  {
-    errno = EINVAL;
-    return -1;
-  }
+static check check_for_gfs;
+static int check_for_gfs(const void *mem, size_t len, struct device_info *info)
+{
+  const uint32_t *p = (const uint32_t *)((const unsigned char *)mem + GFS_OFFSET);
 
-  error = read(fd, buf, 512);
-  if (error < 0)
-    return error;
-  else if (error < 8)
+  if (len < GFS_OFFSET + GFS_SB_SIZE)
     return 1;
 
   if (osi_be32_to_cpu(*p) != 0x01161970 || osi_be32_to_cpu(*(p + 1)) != 1)
     return 1;
 
-  snprintf(type, type_len, "GFS filesystem");
+  info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_GFS, 0, "GFS filesystem");
 
   return 0;
 }
@@ -74,199 +80,186 @@
 
 /**
  * check_for_pool - check to see if Pool is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not Pool,
- *          0 if Pool found (with type set)
  */
 
-static int check_for_pool(int fd, char *type, unsigned type_len)
+enum
 {
-  unsigned char buf[512];
-  uint64 *p = (uint64 *)buf;
-  int error;
+  POOL_SB_SIZE = 512,
+};
 
-  error = lseek(fd, 0, SEEK_SET);
-  if (error < 0)
-    return error;
-  else if (error != 0)
-  {
-    errno = EINVAL;
-    return -1;
-  }
+static check check_for_pool;
+static int check_for_pool(const void *mem, size_t len, struct device_info *info)
+{
+  const uint64_t *p = mem;
 
-  error = read(fd, buf, 512);
-  if (error < 0)
-    return error;
-  else if (error < 8)
+  if (len < POOL_SB_SIZE)
     return 1;
 
   if (osi_be64_to_cpu(*p) != 0x11670)
     return 1;
 
-  snprintf(type, type_len, "Pool subdevice");
+  info_set_container(info, DEVICE_INFO_CONTAINER_POOL, 0, "Pool subdevice");
 
   return 0;
 }
 
 
 /**
- * check_for_paritition - check to see if Partition is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not Partition,
- *          0 if Partition found (with type set)
+ * check_for_partition_msdos - check to see if Partition is on this device
  */
 
-static int check_for_partition(int fd, char *type, unsigned type_len)
+enum
 {
-  unsigned char buf[512];
-  int error;
+  PARTITION_MSDOS_SB_SIZE = 512,
+};
 
-  error = lseek(fd, 0, SEEK_SET);
-  if (error < 0)
-    return error;
-  else if (error != 0)
-  {
-    errno = EINVAL;
-    return -1;
-  }
+static check check_for_partition_msdos;
+static int check_for_partition_msdos(const void *mem, size_t len, struct device_info *info)
+{
+  const unsigned char *buf = mem;
 
-  error = read(fd, buf, 512);
-  if (error < 0)
-    return error;
-  else if (error < 512)
+  if (len < PARTITION_MSDOS_SB_SIZE)
     return 1;
 
   if (buf[510] != 0x55 || buf[511] != 0xAA)
     return 1;
 
-  snprintf(type, type_len, "partition information");
+  info_set_container(info, DEVICE_INFO_CONTAINER_PARTITION, DEVICE_INFO_CONTAINER_PARTITION_MSDOS, "MSDOS partition information");
 
   return 0;
 }
 
+enum
+{
+  EXT23_OFFSET = 1024,
+  EXT23_SB_SIZE = 512,
+  EXT23_BLOCK_SIZE_BITS = 10,
+  EXT23_BLOCK_SIZE = (1 << EXT23_BLOCK_SIZE_BITS),
+  EXT23_SUPER_MAGIC = 0xEF53,
+  EXT23_FEATURE_COMPAT_HAS_JOURNAL = 0x4,
+};
 
+struct ext23_superblock
+{
+  uint32_t _r1[6];                      /**< 0x00 - 0x14 */
+  uint32_t s_log_block_size;            /**< 0x18 */
+  uint32_t _r2[7];                      /**< 0x1c - 0x34 */
+  uint16_t s_magic;                     /**< 0x38 */
+  uint16_t s_state;                     /**< 0x3a */
+  uint32_t _r3[8];                      /**< 0x3c - 0x58 */
+  uint32_t s_feature_compat;            /**< 0x5c */
+  uint32_t s_feature_incompat;          /**< 0x60 */
+  uint32_t s_feature_ro_compat;         /**< 0x64 */
+  uint8_t s_uuid[16];                   /**< 0x68 - 0x77 */
+};
+
 /**
  * check_for_ext23 - check to see if EXT23 is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * An EINVAL returned from lseek means that the device was too
- * small -- at least on Linux.
- *
- * Returns: -1 on error (with errno set), 1 if not EXT23,
- *          0 if EXT23 found (with type set)
  */
 
-static int check_for_ext23(int fd, char *type, unsigned type_len)
+static check check_for_ext23;
+static int check_for_ext23(const void *mem, size_t len, struct device_info *info)
 {
-  unsigned char buf[512];
-  uint16 *p = (uint16 *)buf;
-  int error;
+  const struct ext23_superblock *p = (const struct ext23_superblock *)((const unsigned char *)mem + EXT23_OFFSET);
 
-  error = lseek(fd, 1024, SEEK_SET);
-  if (error < 0)
-    return (errno == EINVAL) ? 1 : error;
-  else if (error != 1024)
-  {
-    errno = EINVAL;
-    return -1;
-  }
+  if (len < EXT23_OFFSET + EXT23_SB_SIZE)
+    return 1;
 
-  error = read(fd, buf, 512);
-  if (error < 0)
-    return error;
-  else if (error < 58)
+  if (osi_le16_to_cpu(p->s_magic) != EXT23_SUPER_MAGIC)
     return 1;
 
-  if (osi_le16_to_cpu(p[28]) != 0xEF53)
+  info->block_size = (EXT23_BLOCK_SIZE << osi_le32_to_cpu(p->s_log_block_size));
+
+  if (osi_le16_to_cpu(p->s_feature_compat) & EXT23_FEATURE_COMPAT_HAS_JOURNAL)
+    info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_EXT23, DEVICE_INFO_FILESYSTEM_EXT3, "EXT3 filesystem");
+  else
+    info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_EXT23, DEVICE_INFO_FILESYSTEM_EXT2, "EXT2 filesystem");
+
+  return 0;
+}
+
+
+enum
+{
+  XFS_SB_SIZE = 512,
+  XFS_SB_MAGIC = 0x58465342,
+};
+
+struct xfs_superblock
+{
+  uint32_t sb_magicnum;
+  uint32_t sb_blocksize;
+  uint64_t sb_dblocks;
+  uint64_t sb_rblocks;
+  uint64_t sb_rextents;
+  uint8_t sb_uuid[16];
+};
+
+/**
+ * check_for_xfs - check to see if XFS is on this device
+ */
+
+static check check_for_xfs;
+static int check_for_xfs(const void *mem, size_t len, struct device_info *info)
+{
+  const struct xfs_superblock *p = mem;
+
+  if (len < XFS_SB_SIZE)
     return 1;
 
-  snprintf(type, type_len, "EXT2/3 filesystem");
+  if (osi_be32_to_cpu(p->sb_magicnum) != XFS_SB_MAGIC)
+    return 1;
 
+  info->block_size = osi_be32_to_cpu(p->sb_blocksize);
+
+  info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_XFS, 0, "XFS filesystem");
+
   return 0;
 }
 
 
 /**
  * check_for_swap - check to see if SWAP is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not SWAP,
- *          0 if SWAP found (with type set)
  */
 
-static int check_for_swap(int fd, char *type, unsigned type_len)
+static check check_for_swap;
+static int check_for_swap(const void *mem, size_t len, struct device_info *info)
 {
-  unsigned char buf[8192];
-  int error;
+  const unsigned char *buf = mem;
 
-  error = lseek(fd, 0, SEEK_SET);
-  if (error < 0)
-    return error;
-  else if (error != 0)
-  {
-    errno = EINVAL;
-    return -1;
-  }
-
-  error = read(fd, buf, 8192);
-  if (error < 0)
-    return error;
-  else if (error < 4096)
+  if (len < 8192)
     return 1;
 
   if (memcmp(buf + 4086, "SWAP-SPACE", 10) && memcmp(buf + 4086, "SWAPSPACE2", 10))
     return 1;
 
-  snprintf(type, type_len, "swap device");
+  info_set(info, DEVICE_INFO_SWAP, 0, 0, "swap device");
 
   return 0;
 }
 
 
+enum
+{
+  LVM1_SB_SIZE = 512,
+};
+
 /**
  * check_for_lvm1 - check to see if LVM1 is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not LVM1,
- *          0 if LVM1 found (with type set)
  */
 
-static int check_for_lvm1(int fd, char *type, unsigned type_len)
+static check check_for_lvm1;
+static int check_for_lvm1(const void *mem, size_t len, struct device_info *info)
 {
-  unsigned char buf[512];
-  int error;
+  const unsigned char *buf = mem;
 
-  error = lseek(fd, 0, SEEK_SET);
-  if (error < 0)
-    return error;
-  else if (error != 0)
-  {
-    errno = EINVAL;
-    return -1;
-  }
-
-  error = read(fd, buf, 512);
-  if (error < 0)
-    return error;
-  else if (error < 2)
+  if (len < LVM1_SB_SIZE)
     return 1;
 
   if (buf[0] != 'H' || buf[1] != 'M')
     return 1;
 
-  snprintf(type, type_len, "lvm1 subdevice");
+  info_set_container(info, DEVICE_INFO_CONTAINER_LVM1, 0, "LVM1 subdevice");
 
   return 0;
 }
@@ -274,39 +267,22 @@
 
 /**
  * check_for_lvm2 - check to see if LVM2 is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not LVM2,
- *          0 if LVM1 found (with type set)
  */
 
-static int check_for_lvm2(int fd, char *type, unsigned type_len)
+static check check_for_lvm2;
+static int check_for_lvm2(const void *mem, size_t len, struct device_info *info)
 {
-  unsigned char buf[512];
-  int error;
   int i;
 
+  if (len < 6 * 512)
+    return 1;
+
   /* LVM 2 labels can start in sectors 1-4 */
 
   for (i = 1; i < 5; i++)
   {
-    error = lseek(fd, 512 * i, SEEK_SET);
-    if (error < 0)
-      return (errno == EINVAL) ? 1 : error;
-    else if (error != 512 * i)
-    {
-      errno = EINVAL;
-      return -1;
-    }
+    const unsigned char *buf = (const unsigned char *)mem + 512 * i;
 
-    error = read(fd, buf, 512);
-    if (error < 0)
-      return error;
-    else if (error < 32)
-      return 1;
-
     if (strncmp(buf, "LABELONE", 8) != 0)
       continue;
     if (((uint64_t *)buf)[1] != i)
@@ -315,7 +291,7 @@
     if (strncmp(&buf[24], "LVM2 001", 8) != 0)
       continue;
 
-    snprintf(type, type_len, "lvm2 subdevice");
+    info_set_container(info, DEVICE_INFO_CONTAINER_LVM2, 0, "LVM2 subdevice");
 
     return 0;
   }
@@ -324,127 +300,85 @@
 }
 
 
+enum
+{
+  CIDEV_SB_SIZE = 512,
+};
+
 /**
  * check_for_cidev - check to see if CIDEV is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not CIDEV,
- *          0 if CIDEV found (with type set)
  */
 
-static int check_for_cidev(int fd, char *type, unsigned type_len)
+static check check_for_cidev;
+static int check_for_cidev(const void *mem, size_t len, struct device_info *info)
 {
-  unsigned char buf[512];
-  uint32 *p = (uint32 *)buf;
-  int error;
+  const uint32_t *p = mem;
 
-  error = lseek(fd, 0, SEEK_SET);
-  if (error < 0)
-    return error;
-  else if (error != 0)
-  {
-    errno = EINVAL;
-    return -1;
-  }
-
-  error = read(fd, buf, 512);
-  if (error < 0)
-    return error;
-  else if (error < 4)
+  if (len < CIDEV_SB_SIZE)
     return 1;
 
   if (osi_be32_to_cpu(*p) != 0x47465341)
     return 1;
 
-  snprintf(type, type_len, "CIDEV");
+  info_set_container(info, DEVICE_INFO_CONTAINER_CIDEV, 0, "CIDEV");
 
   return 0;
 }
 
 
+enum
+{
+  CCA_SB_SIZE = 512,
+};
+
 /**
  * check_for_cca - check to see if CCA is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not CCA,
- *          0 if CCA found (with type set)
  */
 
-static int check_for_cca(int fd, char *type, unsigned type_len)
+static check check_for_cca;
+static int check_for_cca(const void *mem, size_t len, struct device_info *info)
 {
-  unsigned char buf[512];
-  uint32 *p = (uint32 *)buf;
-  int error;
+  const uint32_t *p = mem;
 
-  error = lseek(fd, 0, SEEK_SET);
-  if (error < 0)
-    return error;
-  else if (error != 0)
-  {
-    errno = EINVAL;
-    return -1;
-  }
-
-  error = read(fd, buf, 512);
-  if (error < 0)
-    return error;
-  else if (error < 4)
+  if (len < CCA_SB_SIZE)
     return 1;
 
   if (osi_be32_to_cpu(*p) != 0x122473)
     return 1;
 
-  snprintf(type, type_len, "CCA device");
+  info_set_container(info, DEVICE_INFO_CONTAINER_CCA, 0, "CCA device");
 
   return 0;
 }
 
 
+enum
+{
+  REISERFS_SB_SIZE = 65 * 1024,
+};
+
 /**
  * check_for_reiserfs - check to see if reisterfs is on this device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * Returns: -1 on error (with errno set), 1 if not reiserfs,
- *          0 if CCA found (with type set)
  */
 
-static int check_for_reiserfs(int fd, char *type, unsigned type_len)
+static check check_for_reiserfs;
+static int check_for_reiserfs(const void *mem, size_t len, struct device_info *info)
 {
-  unsigned int pass;
-  uint64 offset;
-  unsigned char buf[512];
-  int error;
+  int pass;
 
+  if (len < REISERFS_SB_SIZE)
+    return 1;
+
   for (pass = 0; pass < 2; pass++)
   {
-    offset = (pass) ? 65536 : 8192;
+    unsigned int offset = (pass) ? 65536 : 8192;
+    const unsigned char *p = (const unsigned char *)mem + offset;
 
-    error = lseek(fd, offset, SEEK_SET);
-    if (error < 0)
-      return (errno == EINVAL) ? 1 : error;
-    else if (error != offset)
+    if (strncmp(p + 52, "ReIsErFs", 8) == 0 ||
+	strncmp(p + 52, "ReIsEr2Fs", 9) == 0 ||
+	strncmp(p + 52, "ReIsEr3Fs", 9) == 0)
     {
-      errno = EINVAL;
-      return -1;
-    }
-
-    error = read(fd, buf, 512);
-    if (error < 0)
-      return error;
-    else if (error < 62)
-      return 1;
-
-    if (strncmp(buf + 52, "ReIsErFs", 8) == 0 ||
-	strncmp(buf + 52, "ReIsEr2Fs", 9) == 0 ||
-	strncmp(buf + 52, "ReIsEr3Fs", 9) == 0)
-    {
-      snprintf(type, type_len, "Reiserfs filesystem");
+      info_set_filesystem(info, DEVICE_INFO_FILESYSTEM_REISERFS, 0, "ReiserFS filesystem");
       return 0;
     }
   }
@@ -453,69 +387,49 @@
 }
 
 
-/**
- * identify_device - figure out what's on a device
- * @fd: a file descriptor open on a device open for (at least) reading
- * @type: a buffer that contains the type of filesystem
- * @type_len: the amount of space pointed to by @type
- *
- * The offset of @fd will be changed by this function.
- * This routine will not write to the device.
- *
- * Returns: -1 on error (with errno set), 1 if unabled to identify,
- *          0 if device identified (with type set)
- */
+static check *checks[] =
+{
+  check_for_partition_msdos,
+  check_for_pool,
+  check_for_lvm1,
+  check_for_lvm2,
+  check_for_cidev,
+  check_for_cca,
+  check_for_ext23,
+  check_for_gfs,
+  check_for_reiserfs,
+  check_for_xfs,
+  check_for_swap,
+};
 
-int identify_device(int fd, char *type, unsigned type_len)
+int identify_device(int fd, struct device_info *info)
 {
-  int error;
+  int i;
+  const void *mem;
+  size_t len;
 
-  if (!type || !type_len)
+  if (!info)
   {
     errno = EINVAL;
     return -1;
   }
 
-  error = check_for_pool(fd, type, type_len);
-  if (error <= 0)
-    return error;
+  memset(info, sizeof (struct device_info), 0);
 
-  error = check_for_lvm1(fd, type, type_len);
-  if (error <= 0)
-    return error;
+  if (device_size(fd, &info->device_size) < 0)
+    return -1;
 
-  error = check_for_lvm2(fd, type, type_len);
-  if(error <= 0)
-    return error;
+  len = info->device_size <= 256*1024 ? info->device_size : 256*1024;
 
-  error = check_for_cidev(fd, type, type_len);
-  if (error <= 0)
-    return error;
+  mem = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
 
-  error = check_for_cca(fd, type, type_len);
-  if (error <= 0)
-    return error;
+  for (i = 0; i < sizeof (checks) / sizeof (*checks); ++i)
+  {
+    int error = checks[i](mem, len, info);
+    if (error <= 0)
+      return error;
+  }
 
-  error = check_for_gfs(fd, type, type_len);
-  if (error <= 0)
-    return error;
-
-  error = check_for_ext23(fd, type, type_len);
-  if (error <= 0)
-    return error;
-
-  error = check_for_reiserfs(fd, type, type_len);
-  if (error <= 0)
-    return error;
-
-  error = check_for_swap(fd, type, type_len);
-  if (error <= 0)
-    return error;
-
-  error = check_for_partition(fd, type, type_len);
-  if (error <= 0)
-    return error;
-
   return 1;
 }
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050401/be6ba709/attachment.sig>

From cfeist at redhat.com  Fri Apr  1 18:05:14 2005
From: cfeist at redhat.com (Chris Feist)
Date: Fri, 01 Apr 2005 12:05:14 -0600
Subject: [Linux-cluster] Re: Kernel RPMS
In-Reply-To: <20050331103719.ogbpvxyzcc8g0s4c@imp.rexursive.com>
References: <64370.213.164.3.90.1111585743.squirrel@www.nodata.co.uk>	<42487E5B.3000000@redhat.com>
	<20050331103719.ogbpvxyzcc8g0s4c@imp.rexursive.com>
Message-ID: <424D8D5A.3070103@redhat.com>

Bojan Smojver wrote:

> So, the source kernels available for the current version of RHEL4 (i.e.
> kernel-2.6.9-5.0.3.EL) should be used as a base version for patching 
> with the
> above cluster stuff? I'm guessing the patches will then bring that 
> kernel (i.e.
> the current shipping one) in line with the kernel all those (no longer
> available) RPMS depended on? Or do we have to use the latest vanilla 
> kernels
> from kernel.org? Or it doesn't really matter because all 2.6.x kernels 
> are OK?


I believe that the HEAD is targetted to the latest vanilla kernels, that's 
what you'll want to build against.

Thanks,
Chris



From bojan at rexursive.com  Fri Apr  1 21:46:03 2005
From: bojan at rexursive.com (Bojan Smojver)
Date: Sat, 02 Apr 2005 07:46:03 +1000
Subject: [Linux-cluster] Re: Kernel RPMS
In-Reply-To: <424D8D5A.3070103@redhat.com>
References: <64370.213.164.3.90.1111585743.squirrel@www.nodata.co.uk>
	<42487E5B.3000000@redhat.com>
	<20050331103719.ogbpvxyzcc8g0s4c@imp.rexursive.com>
	<424D8D5A.3070103@redhat.com>
Message-ID: <1112391963.4676.0.camel@beast.rexursive.com>

On Fri, 2005-04-01 at 12:05 -0600, Chris Feist wrote:

> I believe that the HEAD is targetted to the latest vanilla kernels, that's 
> what you'll want to build against.

OK, get it. That's why I probably get LOCK_USE_CLNT issues with all the
tarballs...

-- 
Bojan



From lhh at redhat.com  Sat Apr  2 00:04:19 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 01 Apr 2005 19:04:19 -0500
Subject: [Linux-cluster] Need advice on cluster configuration
In-Reply-To: <424C5C72.4010403@digitaldan.com>
References: <424C5C72.4010403@digitaldan.com>
Message-ID: <1112400259.24872.115.camel@ayanami.boston.redhat.com>

On Thu, 2005-03-31 at 13:24 -0700, Daniel Cunningham wrote:

>    1.whats the relationship between the raw devices used in the cluster 
> software (which can share raw networked storage w/out GFS?) and when 
> your using GFS on  top of it (or are the two unrelated)?

The two are more or less unrelated.  Raw devices in clumanager are used
for storing internal clumanager states, and have a minimum size of 10mb
each. 

Components of GFS (CCA, pool volumes, file systems, etc.) can not be
used atop of those two raw devices, but may be used atop of the same
GNBD volume (with partitioning, of course!).

>    2.  rgmanager, how is this different from the cluster software's 
> fallback (failback?) domains and members taking over a service and a 
> related floating ip from a fallen member? 
> again thanks for your time

rgmanager ~= clumanager+1

Rgmanager is similar to clumanager, but is a bit more modular: it
supports on-the-fly reconfigurations of services and uses CMAN+DLM or
gulm for the infrastructure instead of providing its own.

-- Lon



From pshearer at lumbermens.net  Wed Apr  6 00:35:01 2005
From: pshearer at lumbermens.net (Peter Shearer)
Date: Tue, 5 Apr 2005 17:35:01 -0700
Subject: [Linux-cluster] LOCK_DLM Performance under Fire
Message-ID: <75FE40F00B17B344A490CAAEB6F2217F01A26E@lbcmail1.lumbermens.net>

Hi, Everyone --
 
I've been playing around with RHEL 4 and GFS from the tar files (not
CVS) on three OptiPlex GX280 workstations using hyperthreading, SATA
drives, and GNBD for sharing over a 1Gb network (dual NICs per machine).
I'm exploring moving a legacy file-based COBOL application/database over
to Linux on a bunch of smaller boxes vs its current home of a quad proc
AIX machine.  I have a test application which basically does applies a
bunch of file and record locks on and within files along with some
processor intense sorting algorithms to stress test the power of the
solution.  I'm running into some serious performance discrepancies of
which I hope someone can help me make sense.  Here's what I'm running
into when I test this app on different file systems:
 
ext3 on local disk, the test app takes about 3 min 20 sec to complete.
ext3 on GNBD exported disk (one node only, obviously); completes in
about 3 min 35 sec.
GFS on GNBD mounted with the localflocks option; completes in 5 min 30
sec.
GFS on GNBD mounted using LOCK_DLM with only one server mounting the fs;
completes in 50 min 45 sec.
GFS on GNBD mounted using LOCK_DLM with two servers mounting the fs;
went over 80 min and wasn't even half done.
GFS on GNBD mounted using LOCK_GULM...don't want to go there; I left it
running for over 2 hrs and it was worse off than the two servers using
LOCK_DLM.  :)
 
The test app mostly does a whole lot of file & record level locking --
not a lot of file transfer from the source disk to the memory of the
local server.  iostat on the client and server both show that the
transfer rate of data on and off the hard disk is only at about 300kBs.
top shows that the cpu on the client is being beat up as the dlm_astd,
lock_dlm1, and lock_dlm2 are taking on average 50% - 60% of the proc
(30%, 15%, 15%) and my test app is taking up the rest.  When it's
running on ext3 or GFS mounted with localflocks, there isn't this
problem at all -- the test app goes to 99% of cpu; hence the faster
completion times.  I have isolated the data paths so that the GNBD data
is running over one NIC and the rest of the cluster data is on the
second NIC in these computers.
 
Anyone have some ideas on how to tune this?  Would exporting the GNBD
file system with caching enabled help as I'm not using multiple GNBD
servers, just multiple GNBD clients?  Other options?  Am I just way off
base here?
 
Thanks!
________________________________________
Peter Shearer
A+, MCSE, MCSE: Security, CCNA
IT Network Engineer
Lumbermens
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050405/90c5adf5/attachment.htm>

From teigland at redhat.com  Wed Apr  6 02:53:21 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 6 Apr 2005 10:53:21 +0800
Subject: [Linux-cluster] [PATCH] iddev - convert to general purpose device
	identifier
In-Reply-To: <20050401160703.GB22564@wavehammer.waldi.eu.org>
References: <20050401142031.GA21976@wavehammer.waldi.eu.org>
	<20050401160703.GB22564@wavehammer.waldi.eu.org>
Message-ID: <20050406025321.GB6415@redhat.com>

On Fri, Apr 01, 2005 at 06:07:03PM +0200, Bastian Blank wrote:
> Updated patch.
> 
> Changes:
> - Use mmap.
> - Add the device size to the exported information.

Is libmagic standard enough to use instead of iddev?  If not, then
what about http://cvs.freedesktop.org/hal/hal/volume_id/ ?
Then we could just get rid of iddev.

-- 
Dave Teigland  <teigland at redhat.com>



From teigland at redhat.com  Wed Apr  6 03:47:39 2005
From: teigland at redhat.com (David Teigland)
Date: Wed, 6 Apr 2005 11:47:39 +0800
Subject: [Linux-cluster] LOCK_DLM Performance under Fire
In-Reply-To: <75FE40F00B17B344A490CAAEB6F2217F01A26E@lbcmail1.lumbermens.net>
References: <75FE40F00B17B344A490CAAEB6F2217F01A26E@lbcmail1.lumbermens.net>
Message-ID: <20050406034739.GC6415@redhat.com>

On Tue, Apr 05, 2005 at 05:35:01PM -0700, Peter Shearer wrote:

> ext3 on local disk, the test app takes about 3 min 20 sec to complete.
> ext3 on GNBD exported disk (one node only, obviously); completes in
> about 3 min 35 sec.
> GFS on GNBD mounted with the localflocks option; completes in 5 min 30
> sec.
> GFS on GNBD mounted using LOCK_DLM with only one server mounting the fs;
> completes in 50 min 45 sec.
> GFS on GNBD mounted using LOCK_DLM with two servers mounting the fs;
> went over 80 min and wasn't even half done.

It sounds like the app is using fcntl (posix) locks, not flock(2)?
If so, that's a weak spot for lock_dlm which translates posix-lock
requests into multiple dlm lock operations.

That said, it's possible the code may be doing some dumb things that
could be fixed to improve the speed.  If there are hundreds of files
being locked, one simple thing to try is to increase SHRINK_CACHE_COUNT
and SHRINK_CACHE_MAX in lock_dlm.h (sorry, never made them tunable
through proc.)  This relates to some basic caching lock_dlm does for
files that are repeatedly locked/unlocked.

If the app could get by with just using flock() that would certainly be
much faster.  Also, if you could provide the test you use or a simplified
equivalent it would help.

-- 
Dave Teigland  <teigland at redhat.com>



From bastian at waldi.eu.org  Wed Apr  6 07:53:13 2005
From: bastian at waldi.eu.org (Bastian Blank)
Date: Wed, 6 Apr 2005 09:53:13 +0200
Subject: [Linux-cluster] [PATCH] iddev - convert to general purpose device
	identifier
In-Reply-To: <20050406025321.GB6415@redhat.com>
References: <20050401142031.GA21976@wavehammer.waldi.eu.org>
	<20050401160703.GB22564@wavehammer.waldi.eu.org>
	<20050406025321.GB6415@redhat.com>
Message-ID: <20050406075313.GA6054@wavehammer.waldi.eu.org>

On Wed, Apr 06, 2005 at 10:53:21AM +0800, David Teigland wrote:
> Is libmagic standard enough to use instead of iddev?

It returns a string, maybe a little bit too loose.

>                                                       If not, then
> what about http://cvs.freedesktop.org/hal/hal/volume_id/ ?

Returns enough information for the gfs part, name of the filesystem and
uuid, not enough for what I want to use it (block size, filesystem
size).

Bastian

-- 
Worlds are conquered, galaxies destroyed -- but a woman is always a woman.
		-- Kirk, "The Conscience of the King", stardate 2818.9
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050406/8e80a492/attachment.sig>

From pshearer at lumbermens.net  Wed Apr  6 19:01:02 2005
From: pshearer at lumbermens.net (Peter Shearer)
Date: Wed, 6 Apr 2005 12:01:02 -0700
Subject: [Linux-cluster] LOCK_DLM Performance under Fire
Message-ID: <75FE40F00B17B344A490CAAEB6F2217F01A26F@lbcmail1.lumbermens.net>

Ick...it appears the apps's locking mechanism is fnctl.  An strace off
the app is full of...

fcntl64(8, F_SETLK64, {type=F_UNLCK, whence=SEEK_SET, start=2147478526,
len=1024}, 0xbffff5a0) = 0
fcntl64(8, F_SETLK64, {type=F_WRLCK, whence=SEEK_SET, start=2147477478,
len=1}, 0xbffff4f0) = 0

...type messages.

The app itself is a really old COBOL app built on Liant's RM/Cobol -- an
abstraction software similar to java which allows the same object code
to run on Linux, UNIX, and Windows with very little modification through
a runtime application.  So, while I have access to the source for the
compiled object, I don't have access to the runtime app code, which is
really the thing doing all the locking.

This specific testing app is opening one file with locks, but it's
beating that file up.  Essentially, it's going through the file and
performing a series of sorts and searches, which, for the most part,
would beat up the proc more than the I/O.  The "real" application for
the most part will not be nearly as intense, but will open probably
around 100 shared files simultaneously with posix locking.  Would
adjusting the SHRINK_CACHE_COUNT and SHRINK_CACHE_MAX in lock_dlm.h
affect this type of application?  Any other tunable parameters which
will help out?  I'm not tied to DLM at this point...is there another
mechanism which would do this equally well?

As for a test app...I'm not sure I'll be able to provide that.  I'll
look into it, though.

--Peter


-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Tuesday, April 05, 2005 8:48 PM
To: Peter Shearer
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] LOCK_DLM Performance under Fire


On Tue, Apr 05, 2005 at 05:35:01PM -0700, Peter Shearer wrote:

> ext3 on local disk, the test app takes about 3 min 20 sec to complete.
> ext3 on GNBD exported disk (one node only, obviously); completes in
> about 3 min 35 sec.
> GFS on GNBD mounted with the localflocks option; completes in 5 min 30
> sec.
> GFS on GNBD mounted using LOCK_DLM with only one server mounting the
fs;
> completes in 50 min 45 sec.
> GFS on GNBD mounted using LOCK_DLM with two servers mounting the fs;
> went over 80 min and wasn't even half done.

It sounds like the app is using fcntl (posix) locks, not flock(2)?
If so, that's a weak spot for lock_dlm which translates posix-lock
requests into multiple dlm lock operations.

That said, it's possible the code may be doing some dumb things that
could be fixed to improve the speed.  If there are hundreds of files
being locked, one simple thing to try is to increase SHRINK_CACHE_COUNT
and SHRINK_CACHE_MAX in lock_dlm.h (sorry, never made them tunable
through proc.)  This relates to some basic caching lock_dlm does for
files that are repeatedly locked/unlocked.

If the app could get by with just using flock() that would certainly be
much faster.  Also, if you could provide the test you use or a
simplified
equivalent it would help.

-- 
Dave Teigland  <teigland at redhat.com>



From teigland at redhat.com  Thu Apr  7 02:30:37 2005
From: teigland at redhat.com (David Teigland)
Date: Thu, 7 Apr 2005 10:30:37 +0800
Subject: [Linux-cluster] LOCK_DLM Performance under Fire
In-Reply-To: <75FE40F00B17B344A490CAAEB6F2217F01A26F@lbcmail1.lumbermens.net>
References: <75FE40F00B17B344A490CAAEB6F2217F01A26F@lbcmail1.lumbermens.net>
Message-ID: <20050407023037.GA6615@redhat.com>

On Wed, Apr 06, 2005 at 12:01:02PM -0700, Peter Shearer wrote:

> The app itself is a really old COBOL app built on Liant's RM/Cobol -- an
> abstraction software similar to java which allows the same object code
> to run on Linux, UNIX, and Windows with very little modification through
> a runtime application.  So, while I have access to the source for the
> compiled object, I don't have access to the runtime app code, which is
> really the thing doing all the locking.
> 
> This specific testing app is opening one file with locks, but it's
> beating that file up.  Essentially, it's going through the file and
> performing a series of sorts and searches, which, for the most part,
> would beat up the proc more than the I/O.  The "real" application for
> the most part will not be nearly as intense, but will open probably
> around 100 shared files simultaneously with posix locking.  Would
> adjusting the SHRINK_CACHE_COUNT and SHRINK_CACHE_MAX in lock_dlm.h
> affect this type of application?  Any other tunable parameters which
> will help out?  I'm not tied to DLM at this point...is there another
> mechanism which would do this equally well?

Taking a step back, is this a parallelized/clusterized application?
i.e. will it be running concurrently on different machines with the
data shared using GFS?  If so, then the distributed fcntl locks are
critical.  If not, it would be safe to use the localflocks mount option
which means fcntl locks are no longer translated to distributed locks.

-- 
Dave Teigland  <teigland at redhat.com>



From serge at triumvirat.ru  Thu Apr  7 12:02:32 2005
From: serge at triumvirat.ru (Sergey)
Date: Thu, 7 Apr 2005 16:02:32 +0400
Subject: [Linux-cluster] problems with gfs locking
Message-ID: <341716745.20050407160232@triumvirat.ru>

Hello everybody!

Please, someone help me with a huge problem.

We have two servers HP DL380G4 connected to HP MSA500 (Modular Smart
Array with currently installed 3 disks as RAID-5 summary volume of
274G). Servers works under Red Hat Enterprise Linux, data storage is
formatted to GFS. Two months system with 2 nodes works fine. But two
weeks ago we started experiencing problems with system load. Symptoms
are as follows:

1. Server on which httpd is running become unstable because of
increasing of simultaneously running processes - uptime shows numbers
10, 20,..., 120, 160 in few minutes, top hangs after this number is
big enough. If run ps to see httpd processes, all of them will be with
status D (uninterruptible sleep) - so Apache runs MaxClients processes
every of them never ends. I can't kill none of them and they are
locked with high probability by GFS - there are two processes
gulm_Cb_Handler both taking about 100% of CPU usage.

2. Apache server-status shows that almost every process hangs with
status W (sending reply), MySQL shows that lot of connections are open
(each script in auto-prepend file opens connection) but they are
sleeping. Apache document_root points to GFS raid, so every
http-request causes filesystem to read or write files (users activity
was about 8 Gb in 10000 files in last month, which is twice as much in
previous month, when system seemed stable). Now filesystem is used at
15% (about 40Gb of 274Gb), the biggest folder contains over 30000
files - may be this is the reason of problems, like when quantity
turns into (low) quality.

3. Another reason which caused locking of filesystem is cvs, which
goes over all of that thousands of files. But this can not be repeated
- only few times cvs hanged while updating (in fact, checking) some
folders (not very big sometimes).

4. Traffic diagram (by MRTG) shows that when GFS going down there are suspicious
spikes of activity on network interface which is used to link GFS
nodes raising up to 4 Mbits/sec (while average throughput is about 100
kbits/sec) in both sides. We assume that our problems started when we
changed link between two nodes from plain patch cord to Cisco Catalyst
switch (which may have only 10 Mbits/sec througput). Can slow network be the
reason of our troubles? And another question - does journals
synchronizes or is there any other activity between two nodes while
reading data from GFS on one of them?

Thanks for any qualified answers.



--
Sergey



From pshearer at lumbermens.net  Thu Apr  7 16:40:34 2005
From: pshearer at lumbermens.net (Peter Shearer)
Date: Thu, 7 Apr 2005 09:40:34 -0700
Subject: [Linux-cluster] LOCK_DLM Performance under Fire
Message-ID: <75FE40F00B17B344A490CAAEB6F2217F2159AE@lbcmail1.lumbermens.net>

Yes, the idea was to parallelize the app across multiple machines
sharing a common SAN infrastructure (hopefully iSCSI; if not, then GNBD
in the interim).  There is no central control daemon or database
manager; each instance of the app does its own record locking and such,
so it really doesn't matter where the data resides, as long as all the
clients are able to touch the same files.  Therefore, distributed locks
are really important.

I had suspected that the locking subsys was causing the slowdowns, so
that's why I did a test with the localflocks -- it's not as fast as
ext3, but works fine with only one server involved.  Of course, that's
not going to work for this application.  :)

--Peter

-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com] 
Sent: Wednesday, April 06, 2005 7:31 PM
To: Peter Shearer
Cc: linux-cluster at redhat.com
Subject: Re: [Linux-cluster] LOCK_DLM Performance under Fire


On Wed, Apr 06, 2005 at 12:01:02PM -0700, Peter Shearer wrote:

> The app itself is a really old COBOL app built on Liant's RM/Cobol --
an
> abstraction software similar to java which allows the same object code
> to run on Linux, UNIX, and Windows with very little modification
through
> a runtime application.  So, while I have access to the source for the
> compiled object, I don't have access to the runtime app code, which is
> really the thing doing all the locking.
> 
> This specific testing app is opening one file with locks, but it's
> beating that file up.  Essentially, it's going through the file and
> performing a series of sorts and searches, which, for the most part,
> would beat up the proc more than the I/O.  The "real" application for
> the most part will not be nearly as intense, but will open probably
> around 100 shared files simultaneously with posix locking.  Would
> adjusting the SHRINK_CACHE_COUNT and SHRINK_CACHE_MAX in lock_dlm.h
> affect this type of application?  Any other tunable parameters which
> will help out?  I'm not tied to DLM at this point...is there another
> mechanism which would do this equally well?

Taking a step back, is this a parallelized/clusterized application?
i.e. will it be running concurrently on different machines with the
data shared using GFS?  If so, then the distributed fcntl locks are
critical.  If not, it would be safe to use the localflocks mount option
which means fcntl locks are no longer translated to distributed locks.

-- 
Dave Teigland  <teigland at redhat.com>



From daniel at osdl.org  Tue Apr 12 00:13:06 2005
From: daniel at osdl.org (Daniel McNeil)
Date: Mon, 11 Apr 2005 17:13:06 -0700
Subject: [Linux-cluster] test hung after 36 hours
Message-ID: <1113264786.31312.16.camel@ibm-c.pdx.osdl.net>

I started my mount/tar/rm/ tests on Apr  4 17:41 and I hit
a problem at Apr  6 05:30.  So the test ran for 36 hours.
cl030 and cl031 were getting "SM: process_reply invalid"
messages and cl032 got "No response" and "Missed too many
heartbeats"


cl032:
[-- MARK -- Wed Apr  6 05:15:00 2005]
CMAN: removing node cl030a from the cluster : Missed too many heartbeats
CMAN: removing node cl031a from the cluster : No response to messages
CMAN: quorum lost, blocking activity
[-- MARK -- Wed Apr  6 05:30:00 2005]
GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs"

cl030:
[-- MARK -- Wed Apr  6 05:15:00 2005]
CMAN: removing node cl032a from the cluster : Missed too many heartbeats
GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs"
GFS: fsid=gfs_cluster:stripefs.0: Joined cluster. Now mounting FS...
GFS: fsid=gfs_cluster:stripefs.0: jid=0: Trying to acquire journal lock...
GFS: fsid=gfs_cluster:stripefs.0: jid=0: Looking at journal...
GFS: fsid=gfs_cluster:stripefs.0: jid=0: Done
GFS: fsid=gfs_cluster:stripefs.0: jid=1: Trying to acquire journal lock...
GFS: fsid=gfs_cluster:stripefs.0: jid=1: Looking at journal...
GFS: fsid=gfs_cluster:stripefs.0: jid=1: Done
GFS: fsid=gfs_cluster:stripefs.0: jid=2: Trying to acquire journal lock...
GFS: fsid=gfs_cluster:stripefs.0: jid=2: Looking at journal...
GFS: fsid=gfs_cluster:stripefs.0: jid=2: Done
GFS: fsid=gfs_cluster:stripefs.0: jid=3: Trying to acquire journal lock...
GFS: fsid=gfs_cluster:stripefs.0: jid=3: Looking at journal...
GFS: fsid=gfs_cluster:stripefs.0: jid=3: Done
SM: process_reply invalid id=20496 nodeid=4294967295
SM: process_reply invalid id=20497 nodeid=4294967295

cl031:
[-- MARK -- Wed Apr  6 05:15:00 2005]
SM: process_reply invalid id=20496 nodeid=4294967295
SM: process_reply invalid id=20496 nodeid=4294967295
SM: process_reply invalid id=20496 nodeid=4294967295
SM: process_reply invalid id=20497 nodeid=4294967295
SM: process_reply invalid id=20497 nodeid=4294967295
SM: process_reply invalid id=20497 nodeid=4294967295
SM: process_reply invalid id=20500 nodeid=4294967295
SM: process_reply invalid id=20500 nodeid=4294967295
SM: process_reply invalid id=20500 nodeid=4294967295
SM: process_reply invalid id=20501 nodeid=4294967295
SM: process_reply invalid id=20501 nodeid=4294967295
SM: process_reply invalid id=20501 nodeid=4294967295
SM: process_reply invalid id=20504 nodeid=4294967295
SM: process_reply invalid id=20504 nodeid=4294967295
SM: process_reply invalid id=20504 nodeid=4294967295
GFS: Trying to join cluster "lock_dlm", "gfs_cluster:stripefs"
SM: process_reply invalid id=20505 nodeid=4294967295
GFS: fsid=gfs_cluster:stripefs.1: Joined cluster. Now mounting FS...

A bit more info is available here.
http://developer.osdl.org/daniel/GFS/test.04apr2005/

Any ideas on what is going on?

Daniel





From teigland at redhat.com  Tue Apr 12 03:30:26 2005
From: teigland at redhat.com (David Teigland)
Date: Tue, 12 Apr 2005 11:30:26 +0800
Subject: [Linux-cluster] test hung after 36 hours
In-Reply-To: <1113264786.31312.16.camel@ibm-c.pdx.osdl.net>
References: <1113264786.31312.16.camel@ibm-c.pdx.osdl.net>
Message-ID: <20050412033026.GB7350@redhat.com>

On Mon, Apr 11, 2005 at 05:13:06PM -0700, Daniel McNeil wrote:
> I started my mount/tar/rm/ tests on Apr  4 17:41 and I hit
> a problem at Apr  6 05:30.  So the test ran for 36 hours.
> cl030 and cl031 were getting "SM: process_reply invalid"
> messages and cl032 got "No response" and "Missed too many
> heartbeats"

The SM messages are an effect of CMAN removing nodes.  There's a fair
chance that this recent fix will help:
http://sources.redhat.com/ml/cluster-cvs/2005-q2/msg00018.html

-- 
Dave Teigland  <teigland at redhat.com>



From Hansjoerg.Maurer at dlr.de  Wed Apr 13 06:32:24 2005
From: Hansjoerg.Maurer at dlr.de (Hansjoerg Maurer)
Date: Wed, 13 Apr 2005 08:32:24 +0200
Subject: [Linux-cluster] Iozone tests on gfs with 29th march gfs snapshot
Message-ID: <425CBCF8.3090908@dlr.de>

Hi,

we are planning a small cluster (3 nodes in the SAN and 13 nodes with gnbd)
and I have tried gfs 6.1 from 29th march on RHEL testkernel 
2.6.9-6.36.ELsmp x64 (elevator=deadline) Opteron  x64 Hardware
(we want to use  RHEL4 based systems, because our cluster application 
performes much better with
RHEL4 gcc on the opteron Hardware compared to RHEL3 gcc)

Installation was  fine :-)

We have done some iozone runs on a local disk  (SAN hardware is not 
avaliable yet) with
- gfs and lock_dlm
- gfs mounted with localcache
- ext3
- gnbd (test on remote host)
- nfs (test on remote host)

The avaliable memory of the computers was reduced to 1G to speed up the 
test.
There are some interesting points:
- gfs seems to perform nearly as good as ext3 only with reclen 1024 and 
during write
- gfs read performance seems to be not very good (are there any flags to 
improve it?)
- there seems to be no big difference in mount-option with localcache 
and using lock_dlm
- gnbd's write performance seems to be better as nfs
- nfs read performance seems to be better as gnbd's
- running two gnbd tests in parallel reduces read performance 
dramatically (may be an hardware issue,
   because it seems to be the same with nfs)



We want to use he cluster filesystem mostly to read and proceed 2 GB 
datasets.
So we will test it with an application soon, which ist not a synthetic 
benchmark.

Is there any prefered elevator, one should use with gfs?
(I will try some tests this evening)

It would be nice, if anyone with gfs experience could comment on the 
results.
We will have the hardware avaliable for testing until the mid of next week,
though if someone wants me to try some other configurations (including 
current CVS)
give me a note.


Thank you very much

Hansj?rg


-- Ext3

        Run began: Tue Apr 12 20:55:54 2005

        Using minimum file size of 2097152 kilobytes.
        Using maximum file size of 2097152 kilobytes.
        Auto Mode
        Command line used: /opt/iozone/bin/iozone -n 2G -g 2G -a -i 0 -i 
1 -i 2
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  
random    bkwd  record  stride                                  
              KB  reclen   write rewrite    read    reread    read   
write    read rewrite    read   fwrite frewrite   fread  freread
         2097152      64   56657   55876    38324    37653   11437   
20542                                                         
         2097152     128   59898   56835    38341    38075   19130   
33323                                                         
         2097152     256   61116   57787    35269    38075   15809   
43095                                                         
         2097152     512   61501   58274    37937    38266   23262   
39051                                                         
         2097152    1024   58998   53493    37869    35665   37163   
38920                                                         
         2097152    2048   58333   58310    35291    38273   47074   
38147                                                         
         2097152    4096   60143   60270    37873    38445   54552   
37152                                                         
         2097152    8192   57725   50131    37767    38112   54441   
37164                                                         
         2097152   16384   53178   57389    38317    38283   58545   
34425                                                         

iozone test complete.



-- GFS lock_dlm
 
        Run began: Tue Apr 12 19:16:51 2005

        Using minimum file size of 2097152 kilobytes.
        Using maximum file size of 2097152 kilobytes.
        Auto Mode
        Command line used: /opt/iozone/bin/iozone -n 2G -g 2G -a -i 0 -i 
1 -i 2
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  
random    bkwd  record  stride                                  
              KB  reclen   write rewrite    read    reread    read   
write    read rewrite    read   fwrite frewrite   fread  freread
         2097152      64   37142   37320    24185    25787    7972   
22459                                                         
         2097152     128   34475   34459    25381    25408   13556   
21645                                                         
         2097152     256   33592   33405    25220    25222   11011   
21268                                                         
         2097152     512   32269   31527    25081    25308   14364   
23356                                                         
         2097152    1024   57865   35774    24190    24919   23027   
28430                                                         
         2097152    2048   41923   34880    25364    24216   29515   
31278                                                         
         2097152    4096   46328   35647    24187    25127   32752   
30084                                                         
         2097152    8192   32296   34830    25172    24000   35714   
31385                                                         
         2097152   16384   40136   38008    24386    25317   39407   
37255                                                         

iozone test complete.


-- GFS mounted with localcache

        Run began: Tue Apr 12 19:17:53 2005

        Using minimum file size of 2097152 kilobytes.
        Using maximum file size of 2097152 kilobytes.
        Auto Mode
        Command line used: /opt/iozone/bin/iozone -n 2G -g 2G -a -i 0 -i 
1 -i 2
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  
random    bkwd  record  stride                                  
              KB  reclen   write rewrite    read    reread    read   
write    read rewrite    read   fwrite frewrite   fread  freread
         2097152      64   35995   35988    24151    26080    8119   
21277                                                         
         2097152     128   31582   35475    25624    25818   13604   
22245                                                         
         2097152     256   30484   34576    25344    25073   11117   
21686                                                         
         2097152     512   29355   35323    25542    25107   14942   
24594                                                         
         2097152    1024   33213   32260    24896    25344   22427   
26454                                                         
         2097152    2048   34852   36949    24417    25377   30143   
31454                                                         
         2097152    4096   42722   33431    24978    24774   32416   
31650                                                         
         2097152    8192   43942   32786    25752    24461   36606   
33237                                                         
         2097152   16384   32568   33575    25072    25506   38057   32971


- NFS mounted ext3

        Run began: Tue Apr 12 22:08:04 2005

        Include close in write timing
        Using minimum file size of 2097152 kilobytes.
        Using maximum file size of 2097152 kilobytes.
        Auto Mode
        Command line used: /opt/iozone/bin/iozone -c -n 2G -g 2G -a -i 0 
-i 1 -i 2
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  
random    bkwd  record  stride                                  
              KB  reclen   write rewrite    read    reread    read   
write    read rewrite    read   fwrite frewrite   fread  freread
         2097152      64   29933   24934    34428    34329   11396    
5095                                                         
         2097152     128   29848   24387    33693    33815   18280    
5822                                                         
         2097152     256   30033   25856    33915    33520   24666    
7303                                                         
         2097152     512   29868   26192    33932    32201   16207    
7797                                                         
         2097152    1024   30857   25485    31165    32378   27212    
9730                                                         
         2097152    2048   28617   24478    33258    35049   41341   
10497                                                         
         2097152    4096   29514   25804    33221    31459   49188    
9050                                                         
         2097152    8192   30777   25721    32443    32264   46874    
8502                                                         
         2097152   16384   28470   25419    34607    34286   57369    
8056                                                         

iozone test complete.


- GNBD mounted GFS

        Run began: Tue Apr 12 19:16:51 2005

        Using minimum file size of 2097152 kilobytes.
        Using maximum file size of 2097152 kilobytes.
        Auto Mode
        Command line used: /opt/iozone/bin/iozone -n 2G -g 2G -a -i 0 -i 
1 -i 2
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  
random    bkwd  record  stride                                  
              KB  reclen   write rewrite    read    reread    read   
write    read rewrite    read   fwrite frewrite   fread  freread
         2097152      64   37142   37320    24185    25787    7972   
22459                                                         
         2097152     128   34475   34459    25381    25408   13556   
21645                                                         
         2097152     256   33592   33405    25220    25222   11011   
21268                                                         
         2097152     512   32269   31527    25081    25308   14364   
23356                                                         
         2097152    1024   57865   35774    24190    24919   23027   
28430                                                         
         2097152    2048   41923   34880    25364    24216   29515   
31278                                                         
         2097152    4096   46328   35647    24187    25127   32752   
30084                                                         
         2097152    8192   32296   34830    25172    24000   35714   
31385                                                         
         2097152   16384   40136   38008    24386    25317   39407   
37255                                                         

iozone test complete.

--GNBD mounted  GFS (2 simultanous runs und 2 gnbd clients)

        Run began: Tue Apr 12 22:33:15 2005

        Using minimum file size of 2097152 kilobytes.
        Using maximum file size of 2097152 kilobytes.
        Auto Mode
        Command line used: /opt/iozone/bin/iozone -n 2G -g 2G -a -i 0 -i 
1 -i 2
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  
random    bkwd  record  stride                                  
              KB  reclen   write rewrite    read    reread    read   
write    read rewrite    read   fwrite frewrite   fread  freread
         2097152      64   21310   20580     9275    10662    3472   
10022                                                         
         2097152     128   11953   17971     8171    10598    4633   
17555                                                         
         2097152     256   15856   22844     5131    11313    4157   
25773                                                         
         2097152     512   15887   28725     4182    12117    7046   
27678                                                         
         2097152    1024   28444   26170     3780    10233    9771   
30745                                                         
         2097152    2048   38449   27667     4042    10059   12872   
30426                                                         
         2097152    4096   27096   35431     4699    11047   15139   
31802                                                         
         2097152    8192   29305   37275     4696     9728   14856   
39259                                                         
         2097152   16384   26824   49085     8836     5279   16865   
57405                                                         

iozone test complete.


-- 
_________________________________________________________________

Dr.  Hansjoerg Maurer           | LAN- & System-Manager
                                |
Deutsches Zentrum               | DLR Oberpfaffenhofen
  f. Luft- und Raumfahrt e.V.   |
Institut f. Robotik             |
Postfach 1116                   | Muenchner Strasse 20
82230 Wessling                  | 82234 Wessling
Germany                         |
                                |
Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
__________________________________________________________________


There are 10 types of people in this world, 
those who understand binary and those who don't.



From birger at birger.sh  Wed Apr 13 12:18:52 2005
From: birger at birger.sh (birger)
Date: Wed, 13 Apr 2005 14:18:52 +0200
Subject: [Linux-cluster] Problems compiling cluster software on fedora core 3
Message-ID: <425D0E2C.20102@birger.sh>

I fetched the cluster sources using cvs today.
I have tried compiling them on Fedora Core 3

I used ./configure --kernel=/lib/modules/2.6.11-1.14_FC3/build
to compile without installing full kernel source.

First I had to edit cluster/cman/lib/libcman.c and change
#include <cluster/cnxman-socket.h>
into
#include <cnxman-socket.h>

Then I ran into problems with cmirror wanting dm-log.h and dm-io.h. I 
found these in the device-manager source, but compilation then fails 
with syntax errors.

Can someone give some advice on how to compile and install this?

-- 
birger



From jbrassow at redhat.com  Wed Apr 13 14:01:11 2005
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Wed, 13 Apr 2005 09:01:11 -0500
Subject: [Linux-cluster] Problems compiling cluster software on fedora
	core 3
In-Reply-To: <425D0E2C.20102@birger.sh>
References: <425D0E2C.20102@birger.sh>
Message-ID: <b843add28929906067c4343fad67b096@redhat.com>


On Apr 13, 2005, at 7:18 AM, birger wrote:

> Then I ran into problems with cmirror wanting dm-log.h and dm-io.h. I 
> found these in the device-manager source, but compilation then fails 
> with syntax errors.

I shouldn't even be compiling this from the top level.  It's not ready 
and there needs to be accompanying device-mapper changes.

Please forgive.  I think cmirror is the last thing to compile, so if 
you ignore the error the rest should have installed fine.  If you don't 
like the errors, you can comment out cmirror in the makefile - which is 
what I'm going to do right now.

  brassow



From hansjoerg.maurer at dlr.de  Wed Apr 13 18:35:40 2005
From: hansjoerg.maurer at dlr.de (=?ISO-8859-1?Q?Hansj=F6rg_Maurer?=)
Date: Wed, 13 Apr 2005 20:35:40 +0200
Subject: [Linux-cluster] GNBD multipath with devicemapper?
Message-ID: <425D667C.2050701@dlr.de>

Hi

I am trying to set up gnbd with multipath.
Accoding to the gnbd_usage.txt file, I understand, that this should work 
with
dm-multipath.
But unfortunatly only the gfs part of the setup is descriped there.

Has anybody experiance with this setup, especially how to set up
multipath with multiple /dev/gnbd* and how to setup the multipath.conf file
 
Thank you very much

Hansj?rg Maurer




From daniel at osdl.org  Wed Apr 13 21:56:08 2005
From: daniel at osdl.org (Daniel McNeil)
Date: Wed, 13 Apr 2005 14:56:08 -0700
Subject: [Linux-cluster] oops after 12 hours during umount
In-Reply-To: <20050412033026.GB7350@redhat.com>
References: <1113264786.31312.16.camel@ibm-c.pdx.osdl.net>
	<20050412033026.GB7350@redhat.com>
Message-ID: <1113429368.31312.39.camel@ibm-c.pdx.osdl.net>

On Mon, 2005-04-11 at 20:30, David Teigland wrote:
> On Mon, Apr 11, 2005 at 05:13:06PM -0700, Daniel McNeil wrote:
> > I started my mount/tar/rm/ tests on Apr  4 17:41 and I hit
> > a problem at Apr  6 05:30.  So the test ran for 36 hours.
> > cl030 and cl031 were getting "SM: process_reply invalid"
> > messages and cl032 got "No response" and "Missed too many
> > heartbeats"
> 
> The SM messages are an effect of CMAN removing nodes.  There's a fair
> chance that this recent fix will help:
> http://sources.redhat.com/ml/cluster-cvs/2005-q2/msg00018.html

Good news and bad news. 

Good news: I think my previous problem was an
network upgrade that accidentally cut off one of my nodes.

Bad news: after upgrading to the latest cvs I hit an oops after
12 hours.  The below looks life we are accessing freed memory.
I have slab debug and spin lock debug configured.

Here's the oops:

Unable to handle kernel paging request at virtual address 6b6b6bbf
 printing eip:
c03e8682
*pde = 00000000
Oops: 0002 [#1]
PREEMPT SMP
Modules linked in: lock_dlm dlm gfs lock_harness cman qla2200 qla2xxx dm_mod video
CPU:    0
EIP:    0060:[<c03e8682>]    Not tainted VLI
EFLAGS: 00010246   (2.6.11)
EIP is at _spin_lock+0x22/0x90
eax: 00000000   ebx: 6b6b6bbf   ecx: 00000001   edx: cdc82000
esi: cdc82000   edi: 6b6b6bbf   ebp: cdc82ea4   esp: cdc82e9c
ds: 007b   es: 007b   ss: 0068
Process umount (pid: 14022, threadinfo=cdc82000 task=cc113a60)
Stack: d2bee958 d2beea7c cdc82ebc c0162f06 d2bee958 d2bee968 d2bee958 6b6b6b6b
       cdc82edc c017bb24 d2bee958 00004192 00000001 cdc82eec ce844050 f90314e0
       cdc82efc c017bc14 cbd665d0 cdc82eec d2bee4ec cbe47b3c cbd66544 ce844050
Call Trace:
 [<c01041ff>] show_stack+0x7f/0xa0
 [<c01043b2>] show_registers+0x162/0x1e0
 [<c01045de>] die+0xfe/0x190
 [<c0115892>] do_page_fault+0x3b2/0x6f2
 [<c0103e57>] error_code+0x2b/0x30
 [<c0162f06>] invalidate_inode_buffers+0x46/0x90
 [<c017bb24>] invalidate_list+0x44/0xe0
 [<c017bc14>] invalidate_inodes+0x54/0x90
 [<c0167974>] generic_shutdown_super+0x74/0x140
 [<f9010aee>] gfs_kill_sb+0x2e/0x69 [gfs]
 [<c0167821>] deactivate_super+0x81/0xa0
 [<c017ed5c>] sys_umount+0x3c/0xa0
 [<c017edd9>] sys_oldumount+0x19/0x20
 [<c010335d>] sysenter_past_esp+0x52/0x75
Code: 00 00 00 8d bf 00 00 00 00 55 89 e5 83 ec 08 89 1c 24 89 c3 b8 01 00 00 00 89 74 24 04 e8 47 06 d3 ff be 00 f0 ff ff 21 e6 31 c0 <86> 03 84 c0 7e 0b 8b 1c 24 8b 74 24 04 89 ec 5d c3 b8 01 00 00


Daniel



From birger at birger.sh  Thu Apr 14 07:53:36 2005
From: birger at birger.sh (birger)
Date: Thu, 14 Apr 2005 09:53:36 +0200
Subject: [Linux-cluster] Configuration of a 2 node HA cluster with gfs
Message-ID: <425E2180.6060609@birger.sh>

I think I sent this with a wrong sender address, as I didn't see it 
apear on the list. I have solved the problem with starting ccsd in my 
original message (if it ever went out)

I have just installed the cluster package, and I am now looking for some
help on how to use it :-)

I have a lot of experience with Veritas FirstWatch and some with
SunCluster, so I am not new to HA services. Now I have this server that
I have to get up and running as quickly as possible... And I find little
documentation about how to get this software up and running from a fresh
install.

I have one old file server with external scsi disks, and one new server
with a SCSI-attached Nexsan ATAboy RAID array.

I want to set up the new file server as half of a 2-node cluster and get
it into production. Then move over data (and disks) from the old server
until I can reinstall that one as the second cluster node.

I first thought along the lines I am used to from Solaris and
clustering. I wanted to set up 2 services: 1 NFS service that would take
the disk with it when it moved, and 1 samba service that NFS-mounted the
disk from the nfs service.

After looking at the redhat stuff I am thinking:
- Mount the disks permanently on both nodes using gfs (less chance of
nuking the file systems because of a split-brain)
- Perhaps also run NFS services permanently on both nodes, failing over
only the IP address of the official NFS service. Should make failover
even faster, but are there pitfalls to running multiple NFS servers off
the same gfs file system? In addition to failing over the IP address, I
would have to look into how to take along NFS file locks when doing a
takeover.
- samba running as a service that fails over if a node goes down.

Can anyone 'talk me through' the steps needed to get this up and running?

First attempts at starting ccsd failed with
Failed to connect to cluster manager.
Hint: Magma plugins are not in the right spot.

I fixed this by cd'ing down into magma in the cluster directory i 
fetched with cvs and doing
make clean
make
make install

When I did a make from top-level of the cvs sources magma got built with 
plugin dir pointing into the source directory. Just recompiling (without 
rerunning configure) fixed it. Something to look into for the maintainers?

I now have my first gfs file system, but I get a permission denied when 
trying to mount it. How should I diagnose this?

-- 
birger



From birger at birger.sh  Thu Apr 14 07:54:36 2005
From: birger at birger.sh (birger)
Date: Thu, 14 Apr 2005 09:54:36 +0200
Subject: [Linux-cluster] Problems compiling cluster software on fedora
	core 3
In-Reply-To: <b843add28929906067c4343fad67b096@redhat.com>
References: <425D0E2C.20102@birger.sh>
	<b843add28929906067c4343fad67b096@redhat.com>
Message-ID: <425E21BC.9030000@birger.sh>


Resending this as I may have sent it using wrong sender address. I never 
saw it appear...

Jonathan E Brassow wrote:

>
> I shouldn't even be compiling this from the top level.  It's not ready 
> and there needs to be accompanying device-mapper changes.

That explains a lot :-D

>
> Please forgive.  I think cmirror is the last thing to compile, so if 
> you ignore the error the rest should have installed fine.  If you 
> don't like the errors, you can comment out cmirror in the makefile - 
> which is what I'm going to do right now.

I did, and did a new make (that didn't have anything to do) and a make
install.
Thanks for your answer. It certainly solved my problems.



From birger at uib.no  Thu Apr 14 05:31:15 2005
From: birger at uib.no (Birger Wathne)
Date: Thu, 14 Apr 2005 07:31:15 +0200
Subject: [Linux-cluster] Problems compiling cluster software on fedora
	core 3
In-Reply-To: <b843add28929906067c4343fad67b096@redhat.com>
References: <425D0E2C.20102@birger.sh>
	<b843add28929906067c4343fad67b096@redhat.com>
Message-ID: <425E0023.2030609@uib.no>

Jonathan E Brassow wrote:

>
> I shouldn't even be compiling this from the top level.  It's not ready 
> and there needs to be accompanying device-mapper changes.

That explains a lot :-D

>
> Please forgive.  I think cmirror is the last thing to compile, so if 
> you ignore the error the rest should have installed fine.  If you 
> don't like the errors, you can comment out cmirror in the makefile - 
> which is what I'm going to do right now.

I did, and did a new make (that didn't have anything to do) and a make 
install.
Thanks for your answer. It certainly solved my problems.

-- 
birger



From birger at uib.no  Thu Apr 14 05:52:30 2005
From: birger at uib.no (Birger Wathne)
Date: Thu, 14 Apr 2005 07:52:30 +0200
Subject: [Linux-cluster] Configuration of a 2 node HA cluster with gfs
Message-ID: <425E051E.2050208@uib.no>

I have just installed the cluster package, and I am now looking for some 
help on how to use it :-)

I have a lot of experience with Veritas FirstWatch and some with 
SunCluster, so I am not new to HA services. Now I have this server that 
I have to get up and running as quickly as possible... And I find little 
documentation about how to get this software up and running from a fresh 
install.

I have one old file server with external scsi disks, and one new server 
with a Nexsan ATAboy RAID array.
I want to set up the new file server as half of a 2-node cluster and get 
it into production. Then move over data (and disks) from the old server 
until I can reinstall that one as the second cluster node.

I first thought along the lines I am used to from Solaris and 
clustering. I wanted to set up 2 services: 1 NFS service that would take 
the disk with it when it moved, and 1 samba service that NFS-mounted the 
disk from the nfs service.

After looking at the redhat stuff I am thinking:
- Mount the disks permanently on both nodes using gfs (less chance of 
nuking the file systems because of a split-brain)
- Perhaps also run NFS services permanently on both nodes, failing over 
only the IP address of the official NFS service. Should make failover 
even faster, but are there pitfalls to running multiple NFS servers off 
the same gfs file system? In addition to failing over the IP address, I 
would have to look into how to take along NFS file locks when doing a 
takeover.

Can anyone 'talk me through' the steps needed to get this up and running?
I have tried to create /etc/cluster/cluster.conf, but ccsd fails with
Failed to connect to cluster manager.
Hint: Magma plugins are not in the right spot.


-- 
birger



From fabbione at fabbione.net  Thu Apr 14 14:44:44 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Thu, 14 Apr 2005 16:44:44 +0200 (CEST)
Subject: [Linux-cluster] [PATCH] Fix cman-kernel build with 2.6.12rc2
Message-ID: <20050414144444.6ABE12A8C@trider-g7.fabbione.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi everybody,

the following patch fixes compilation of cnxman.c with 2.6.12rc2
replacing sk_zapped with its new substitute.

Please apply.

Signed-off-by: Fabio Massimo Di Nitto <fabbione at ubuntu.com>

Index: cman-kernel/src/cnxman.c
===================================================================
RCS file: /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v
retrieving revision 1.55
diff -u -r1.55 cnxman.c
- --- cman-kernel/src/cnxman.c	5 Apr 2005 13:43:09 -0000	1.55
+++ cman-kernel/src/cnxman.c	14 Apr 2005 14:27:03 -0000
@@ -1065,7 +1065,7 @@
 	if (!capable(CAP_NET_BIND_SERVICE))
 		return -EPERM;
 
- -	if (sk->sk_zapped == 0)
+	if (sock_flag(sk, SOCK_ZAPPED) == 0)
 		return -EINVAL;
 
 	if (addr_len != sizeof (struct sockaddr_cl))
@@ -1089,7 +1089,7 @@
 	up(&port_array_lock);
 
 	c->port = saddr->scl_port;
- -	sk->sk_zapped = 0;
+	sock_reset_flag(sk, SOCK_ZAPPED);
 
 	/* If we are not a cluster member yet then make the client wait until
 	 * we are, this allows nodes to start cluster clients at the same time
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iQIVAwUBQl6BOlA6oBJjVJ+OAQIUPBAAtPSWMm5k9JvH+fMt+H193fJfgfEWa7EL
jfwC2GP/fSAz7RmhXvChdqkrxSBt7VS5oEKwwG6kt7tfJZIvxIACaEzaOOXLidHh
UyOWJHUj0twQZQYsBEr8nN7lcfC2+jCiYSgbCoUzD/+36q/QOFoLeeAOa0am+c7t
lc4QRmBt1tfGPtf7dTVgNpwX/hZPSlLXLDaBkUBztvncTTZNmmQ15jmjufXcdj31
lGtYzc8we5540YtbWKmFG0/M6AOS0BCOBT+3vLxtwbdXIeO2c+1BOhFr3cWPYgAg
01q/TlMN504P7KyhOm7G/an/exrbzDDKflgzAadEuzDoFAFnDZG0FCdpaz/Fvd3j
YkJFPqMJuX0DuhjiHJlwwvzkcjO32RWBAs06lYlCVjyK6mf4GasUF3dJPAIIOhWh
ZHK7c0+dWyUA8GINjdJaCkKn3Yz/zmFxFLSUEZsl63A4AXlP7cc+Nz3+0VfmXbVI
c97hN/xd3dS/v4LZyE76kHxjTRCyDCKzszF/9iW+0O9mOnSc/FxLfgm8dl86xZVH
fpj2fx/8IWDWMXLANANVigXDxjJWZjSyDCZOutbY1Q/0/mUIg/CLZq18HCKjEOKu
LmK0V7ayRPop1HFuNFueWA7nOTrKYNjkuZia2uxIE4s+5i7BsiyTgWjxYZHP3LtO
9JefecsNpwc=
=QzsY
-----END PGP SIGNATURE-----



From fabbione at fabbione.net  Thu Apr 14 14:45:02 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Thu, 14 Apr 2005 16:45:02 +0200 (CEST)
Subject: [Linux-cluster] [PATCH] Fix dlm-kernel build with 2.6.12rc2
Message-ID: <20050414144502.C75AF2A8C@trider-g7.fabbione.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi everybody,

the following patch fixes compilation of nodes.c with 2.6.12rc2.

A macro called nodes_clear has been recently introduced. This leads
to a clash. I renamed the DLM one to nodes_nodes_clear only to solve
the problem, but of course my patch isn't authoritative. Feel
free to rename it as you wish :)

Signed-off-by: Fabio Massimo Di Nitto <fabbione at ubuntu.com>

Index: dlm-kernel/src/nodes.c
===================================================================
RCS file: /cvs/cluster/cluster/dlm-kernel/src/nodes.c,v
retrieving revision 1.12
diff -u -r1.12 nodes.c
- --- dlm-kernel/src/nodes.c	27 Jan 2005 09:23:45 -0000	1.12
+++ dlm-kernel/src/nodes.c	14 Apr 2005 14:28:08 -0000
@@ -277,7 +277,7 @@
 	return error;
 }
 
- -static void nodes_clear(struct list_head *head)
+static void nodes_nodes_clear(struct list_head *head)
 {
 	struct dlm_csb *csb;
 
@@ -290,13 +290,13 @@
 
 void ls_nodes_clear(struct dlm_ls *ls)
 {
- -	nodes_clear(&ls->ls_nodes);
+	nodes_nodes_clear(&ls->ls_nodes);
 	ls->ls_num_nodes = 0;
 }
 
 void ls_nodes_gone_clear(struct dlm_ls *ls)
 {
- -	nodes_clear(&ls->ls_nodes_gone);
+	nodes_nodes_clear(&ls->ls_nodes_gone);
 }
 
 int ls_nodes_init(struct dlm_ls *ls, struct dlm_recover *rv)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iQIVAwUBQl6BQlA6oBJjVJ+OAQKQYg/+OsUuKjv3gS3T0+c30QNP1hWGBY/QP250
q3stoxo1Nt8NSRmDI2CXuHpCMah5XVPmyf13nMDj2m60VrGi51Jyrt9PDbYhYbXA
W1weCsZKf7aD1SVnJ6Dauebj3eU/PPU53n1M8+6RrRqkVLnXW+MqIcEt/SAKO55V
uovdTYrZrXJYdR6uU49Ss1qdTGQP6QbLzf8ivXPzl44GlWHhYMSc4WvV3RxRslxI
n/BUhCL2MJdhX26r8w5dgL2VG5bnz3x8S8T37Eh8Bs39VAwuleW5Cfpr4MV4Yhuw
WTrkIF56BFSiqDvvLuvhVVZlAFKvqeU6Xp9kVMUMAOJ7tjTwwwbX+TTUL/YAi7Yp
UqaMk9yAhgiVsBK+/0AqHJYx1mGfZhgbQ1A3Wr0uIADDsoHI4OP3ZsaYnKlX04rm
1JoxF1I4nHg0hlgGJHLCaGTTIRuVKzIvutFFZL9oWj6fp3mxw6MYo7nWxPxjMAhT
hCfo8YjoGlpUVlnBgOSRCVUNaCXwcxNWgRyoVfn8yX+vufTQZxUCzX5SkiFcMDn1
uP8sUFvNoSvNYhzutd4Ma5pb2I+Qu4zVxKFRQ7rBSOBFn9UI4klS5Eu9JlMAzd5f
fmDK4lXDM/9yjjqSbQTNAV2gSkrwwtxfu2DSGXja/Xkh5MtPdv2OEbXGLLkTiZ8g
u/FUdUCytd8=
=pKJ1
-----END PGP SIGNATURE-----



From fabbione at fabbione.net  Thu Apr 14 14:45:19 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Thu, 14 Apr 2005 16:45:19 +0200 (CEST)
Subject: [Linux-cluster] [PATCH] Fix gnbd-kernel build with 2.6.12rc2
Message-ID: <20050414144519.546922A8C@trider-g7.fabbione.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi everybody,

the following patch fixes compilation of gnbd.c with 2.6.12rc2.

The i_sock has been recently removed from the inode structure (change
happened in the kernel tree the 1st of April) and made part of i_mode.

Please apply.

Signed-off-by: Fabio Massimo Di Nitto <fabbione at ubuntu.com>

Index: gnbd-kernel/src/gnbd.c
===================================================================
RCS file: /cvs/cluster/cluster/gnbd-kernel/src/gnbd.c,v
retrieving revision 1.7
diff -u -r1.7 gnbd.c
- --- gnbd-kernel/src/gnbd.c	7 Apr 2005 16:19:37 -0000	1.7
+++ gnbd-kernel/src/gnbd.c	14 Apr 2005 14:30:29 -0000
@@ -735,7 +735,7 @@
 		if (!file)
 			return error;
 		inode = file->f_dentry->d_inode;
- -		if (!inode->i_sock) {
+		if (!S_ISSOCK(inode->i_mode)) {
 			fput(file);
 			return error;
 		}
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iQIVAwUBQl6BTVA6oBJjVJ+OAQIU7BAAnea52QS9ISXWHXWrrqeEaqFVbm1bSs1A
+BKMycDiSsDKwttb+/bma2V56gjdqnv7//11wv2IiG5lt1q1HebgVTM+ecPMCRBb
6VsJV2NB+HgjRtcNkbAiw7hLVpcG+WFe5VaFVSsG20B5I47n9ahkF0a8umY4zSbd
O1pCBJA3H4QMiTwlNA8kEj5EBdc3/jB4KCYGwGNhR7m61etZ4JMiEdGlOeQwYMK1
4DcXpCgo8aBLACUHGST2e3mnq48ztHHMNI7M0H8BLNrUbhm1EtIEtzyXqJjrS7ku
TNZKKyfjlioAJk4B718ValMMEifZtlxwjlT3FEYfEd7/MUA2sw6ET4arFbDKcGjU
Bn5wdFdoVDZpDwhWICfQq2rVleBydNGCyZ4HYMcI3WBi3RKH21zrLnt5YqL9EA/9
9TC8PhD24i8+9rp/kmRV3QtWJtooEO2VSfGKJSDXHoeKkt8S2RTByxuBo5UpBMkI
z/+lB8zlDyF+qvn3TtkaTuJC8fk3clrkQfT+jiI4/7ZztK37NgcCF9Qe1rac3QS4
VFRTrYJD8hcAOMa40HHCdZTyezetE4N/m6SDOJ+Pps+2KTWYxkJguas0+Aua5yeP
jyyAV3vmKMmPewbNknw1gHoPTI4pz1QUZ89E3hhnmM1Zoi6y4CMzq1ndv/ZqAROx
cS4j9lsnd60=
=+YaG
-----END PGP SIGNATURE-----



From CAugustine at overlandstorage.com  Thu Apr 14 18:22:15 2005
From: CAugustine at overlandstorage.com (CAugustine at overlandstorage.com)
Date: Thu, 14 Apr 2005 11:22:15 -0700
Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 12, Issue 9
Message-ID: <OF58196766.16914E59-ON88256FE3.006497A2-88256FE3.0064EA66@myoverland.net>

Hi Everyone,

I have tried to down load the cluster software by running:

        cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster login cvs 
 
and providing the password=cvs. Unfortunately, the connection to 
sources.redhat.com (12.
107.209.250):2401 faileds with a "connection time out" messages. I am not 
sure if there is
a problem at redhat or locally at my site...

Any suggestions?

Thanks,
Caroline

----------------------------------------------------------------------------------------------
See our award-winning line of tape and disk-based
backup & recovery solutions at http://www.overlandstorage.com
----------------------------------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050414/42255e2d/attachment.htm>

From rstevens at vitalstream.com  Thu Apr 14 19:26:30 2005
From: rstevens at vitalstream.com (Rick Stevens)
Date: Thu, 14 Apr 2005 12:26:30 -0700
Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 12, Issue 9
In-Reply-To: <OF58196766.16914E59-ON88256FE3.006497A2-88256FE3.0064EA66@myoverland.net>
References: <OF58196766.16914E59-ON88256FE3.006497A2-88256FE3.0064EA66@myoverland.net>
Message-ID: <425EC3E6.8040608@vitalstream.com>

CAugustine at overlandstorage.com wrote:
> 
> Hi Everyone,
> 
> I have tried to down load the cluster software by running:
> 
>         cvs -d :pserver:cvs at sources.redhat.com:/cvs/cluster login cvs  
>    
> and providing the password=cvs. Unfortunately, the connection to 
> sources.redhat.com (12.
> 107.209.250):2401 faileds with a "connection time out" messages. I am 
> not sure if there is
> a problem at redhat or locally at my site...

Your firewall probably blocks TCP/UDP port 2401.  CVS :pserver:
operations use that port.  Poke a hole in your firewall to allow
incoming data on both TCP port 2401 and UDP port 2401.
----------------------------------------------------------------------
- Rick Stevens, Senior Systems Engineer     rstevens at vitalstream.com -
- VitalStream, Inc.                       http://www.vitalstream.com -
-                                                                    -
- If at first you don't succeed, quit. No sense being a damned fool! -
----------------------------------------------------------------------



From pcaulfie at redhat.com  Fri Apr 15 08:08:59 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 15 Apr 2005 09:08:59 +0100
Subject: [Linux-cluster] [PATCH] Fix dlm-kernel build with 2.6.12rc2
In-Reply-To: <20050414144502.C75AF2A8C@trider-g7.fabbione.net>
References: <20050414144502.C75AF2A8C@trider-g7.fabbione.net>
Message-ID: <20050415080859.GA23730@tykepenguin.com>

On Thu, Apr 14, 2005 at 04:45:02PM +0200, Fabio Massimo Di Nitto wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi everybody,
> 
> the following patch fixes compilation of nodes.c with 2.6.12rc2.

Thanks for those. I'll apply them to CVS head when 2.6.12 is released.
-- 

patrick



From fabbione at fabbione.net  Fri Apr 15 10:52:44 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Fri, 15 Apr 2005 12:52:44 +0200
Subject: [Linux-cluster] [PATCH] Fix dlm-kernel build with 2.6.12rc2
In-Reply-To: <20050415080859.GA23730@tykepenguin.com>
References: <20050414144502.C75AF2A8C@trider-g7.fabbione.net>
	<20050415080859.GA23730@tykepenguin.com>
Message-ID: <425F9CFC.1000901@fabbione.net>

Patrick Caulfield wrote:
> On Thu, Apr 14, 2005 at 04:45:02PM +0200, Fabio Massimo Di Nitto wrote:
> 
>>-----BEGIN PGP SIGNED MESSAGE-----
>>Hash: SHA1
>>
>>Hi everybody,
>>
>>the following patch fixes compilation of nodes.c with 2.6.12rc2.
> 
> 
> Thanks for those. I'll apply them to CVS head when 2.6.12 is released.

Welcome :)

Fabio



From mrc at linuxplatform.org  Fri Apr 15 13:19:48 2005
From: mrc at linuxplatform.org (Matt)
Date: Fri, 15 Apr 2005 09:19:48 -0400
Subject: [Linux-cluster] DB Clustering Question
Message-ID: <1113571189.6839.6.camel@althea.playway.net>

Hi everyone, I'm new to this list.  I'm researching database cluster
solutions and I'm not really finding what I'm looking for.

What I really want to do is parallel processing with mySQL or
Postgresql.  If I can't do that, then simply having multiple SQL servers
share the same DB files is the next option.  Can anyone push me in the
right direction?

One last question, does anyone have any experience with the Ingres
database and its clustering features?

-- 
Matt <mrc at linuxplatform.org>



From gwood at dragonhold.org  Fri Apr 15 13:33:48 2005
From: gwood at dragonhold.org (gwood at dragonhold.org)
Date: Fri, 15 Apr 2005 14:33:48 +0100 (BST)
Subject: [Linux-cluster] DB Clustering Question
In-Reply-To: <1113571189.6839.6.camel@althea.playway.net>
References: <1113571189.6839.6.camel@althea.playway.net>
Message-ID: <18349.198.96.134.61.1113572028.squirrel@198.96.134.61>

> What I really want to do is parallel processing with mySQL or
> Postgresql.
This needs support at the DB level.  MySQL has a version that requires the
DB to be smaller than the amount of available memory - since the DB gets
kept in RAM of all the clustered servers (not sure if it does more than 2,
I can't afford that much memory *grin*).  Not sure about postgresql,
you'll have to check their website for cluster options.  If you're willing
to spend money on applications, there is a 3rd party addon for MySQL that
does it, but it's been a while since I looked at it, so I don't know the
details.  From memory, it intercepts queries at the TCP/IP stack layer -
all the machines have the same MAC for a virtual server as well as sharing
state data (even to the TCP/IP layer), and therefore can take over any
existing connections for running servers.  It does mean that you have all
the machines handling all the incoming network traffic which is less than
ideal.

> If I can't do that, then simply having multiple SQL servers
> share the same DB files is the next option.  Can anyone push me in the
> right direction?
This won't work, at least not without application layer support.  In the
same way that you need GFS to get multiple machines to use the same
filesystem, you'd need a similar level of support for locking & caching
within the database.

I think that Oracle have it (for their RAC product) and probably some
others too, but I don't know of anything similar in MySQL at least.

If your usage is skewed to reads rather than writes, then you could
probably do something with replication, but there are details on that on
the various websites too.

Hope this helps some,

Graham



From chrisd at pearsoncmg.com  Fri Apr 15 14:19:42 2005
From: chrisd at pearsoncmg.com (Chris Darroch)
Date: Fri, 15 Apr 2005 10:19:42 -0400
Subject: [Linux-cluster] DB Clustering Question
In-Reply-To: <1113571189.6839.6.camel@althea.playway.net>
References: <1113571189.6839.6.camel@althea.playway.net>
Message-ID: <425FCD7E.3000705@pearsoncmg.com>

Matt wrote:

> What I really want to do is parallel processing with mySQL or
> Postgresql.  If I can't do that, then simply having multiple SQL servers
> share the same DB files is the next option.  Can anyone push me in the
> right direction?

   I'm new to the list as well, but having just gone through the process
of evaluating exactly this kind of problem, I have a few cents I can
throw in.

   I think the very short answer to your question is that databases and
multiple servers don't mix well at all, as a general rule, and if you
need full transactional SQL support in a cluster, you're likely looking
at a commerical solution.

   The fundamental problem is that transactional databases, of which
SQL databases are a subset, need to ensure that all transactions occur
atomically, and to do this, they need very robust, very fast locking
subsystems.  For example, before updating a row in a table, a
database process needs to be sure that it acquires a lock on that data
first, so that other database processes handling other client requests
don't read partially altered data.

   Now locking is hard enough to do when you have just one machine
(either single CPU or multiple CPUs), but can be done quite effectively
and efficiently through the use of in-memory mutexes and other such
devices.  Oracle, for example, takes out a big chunk of shared memory,
which all processes use to coordinate locking.

   Doing this in a cluster of machines is much, much more difficult.
It's compounded by the problem that one or more machines could fail,
or the network could fail in various ways, and the DB software must
ensure that under no conditions does the data become corrupted.
(See all the work involved in the GFS DLM, for example, involving
handling "split brain" conditions and the such like.)

   Oracle RAC (Real Application Cluster) provides this functionality
at considerable expense, for instance, by requiring that you have
a high-speed interconnection network between your machines, and
then by providing its own internal lock manager and cluster monitor
and so forth.  Essentialy, many of the components of GFS are
provided inside Oracle RAC, for its own purposes, but are unavailable
to outside processes.  You can also run Oracle RAC on Linux, in
various ways:

http://www.redhat.com/software/rha/gfs/
http://www.oracle.com/technology/tech/linux/index.html
http://www.veritas.com/van/articles/7655.jsp

If I understand the RedHat option correctly, Oracle relies on GFS
to manage the shared storage in the cluster, but still uses its own
lock manager, cluster monitor, etc., for its own internal cache
management and transaction handling.  However, I haven't read the
installation white paper, so I'm not sure about that.  (Note to
RedHat folks: trying to register on the Web site leads to an
access denied error for the /info/ page.)

   Open source SQL databases like PostgreSQL and MySQL just don't
have this kind of feature, so far as I can determine.  MySQL provides a
cluster mechanism over regular TCP, but as far as I could tell from the
documentation, this works by keeping the entire database in RAM on each
cluster node:

http://dev.mysql.com/doc/mysql/en/multi-hardware-software-network.html

   PostgreSQL can be run in a cluster by emulating a single operating
system underneath it, using high-speed interconnections and special
kernel modifications:

http://www.linuxlabs.com/clusgres.html

   I don't know much about Ingres myself, but I didn't see anything
about clustering for that, either.

   It's perhaps worth noting that PostgreSQL and Oracle face special
complexities regarding data consistency and locking because they
provide MVCC (Multi-Version Concurrency Control), which means that
each database client sees a "snapshot" of the entire database
as it was when they began their transaction.  As long as their
transaction remains active, the database retains previous versions
of all data modified by all other active transactions, so that
the snapshot remains accurate to a past point in time.  Only once
the transaction has closed can the database clean up old versions of
data.  This is subtly different from just providing row-level locking
in a table; if one transaction is slowing reading through all the
rows of a table while another one performs updates of selected rows,
the old versions of the updated rows are kept around until the
reader's transaction closes, in case they are needed to provide an
accurate view of what the data in the table looked like when the
reader's transaction began.  So that's all just to say that the
business of locking and shuffling data around is especially complex
for such databases, and doing it in a cluster even more so.

   What you are able to do with the available options depends partly
on your requirements, obviously.  If you don't mind having multiple
read-only copies of your database files, and allowing them to be
somewhat out of date, there are various ways you could replicate
your data files from a master read-write node to multiple read-only
nodes.  You'd want to ensure that the copying process performed
the necessary interactions with the master database to ensure that
it never copied partially complete data files; performing a hot
backup and then replicating those files to the read-only nodes would
work.

   Another related option if you don't mind having read-only and
slightly out-of-date copies is to use memcached:

http://www.danga.com/memcached/

This functions as a data cache between your client programs and
the database, and spreads the data around to multiple machines.
But obviously write requests need to go to the master database,
and then be replicated to the caches, and there's a period of time
when you might not read up-to-date data from the cache.  But this
may be OK for your application.

   If you need true full transactional SQL support spread across a
cluster, I believe you'll have to look at Oracle or another commerical
solution like the ClusGres one I referenced above.  I'd love to
stand correctly, though, if anyone knows more about this.

Chris.

-- 
GPG Key ID: 366A375B
GPG Key Fingerprint: 485E 5041 17E1 E2BB C263  E4DE C8E3 FA36 366A 375B



From Hansjoerg.Maurer at dlr.de  Fri Apr 15 14:24:21 2005
From: Hansjoerg.Maurer at dlr.de (Hansjoerg.Maurer at dlr.de)
Date: Fri, 15 Apr 2005 16:24:21 +0200
Subject: [Linux-cluster] AW: GNBD multipath with devicemapper? -- possible
	solution
Message-ID: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de>

Hi

I found a solution for the problem descriped below,
but I am not sure if it is the right way.

- importing the two gnbd's (wich point to the same device) from two servers
-> /dev/gnbd0 and /dev/gnbd1 on the client

- creating a multipath device with something like this:
echo "0 167772160  multipath 0 0 1 1 round-robin 0 2 1 251:0 1000 251:1 1000 " | dmsetup create dm0
 (251:0 ist the major:minor id of /dev/gnbd0)

- mounting the created device
eg:
mount -t gfs /dev/mapper/dm0 /mnt/lvol0

If I do a write on /mnt/lvol0 the gnbd_server task on both gnbd_servers start (with a noticeable speedup)

If one gnbd_server fails dm removes that path with the following log
kernel: device-mapper: dm-multipath: Failing path 251:0.

I was able to add it again with

dmsetup  message dm0 0  reinstate_path 251:0


I was able to deactivate a path manually with

dmsetup  message dm0 0  fail_path 251:0 

But I can not unimport the underlying gnbd

gnbd_import: ERROR cannot disconnect device #1 : Device or resource busy


Is there a way to remove a gnbd, which is bunndled in a dm-multipath device?
(might be necessary, if one gnbd server must be rebooted)

How can I reimport an gnbd on the client in state disconnected?
(I had to manually start 
gnbd_recvd -d 0  to do so)

Is the descriped solution for gnbd multipath the right one?

Thank you very much

Greetings from munich

Hansj?rg






>Hi
>
>I am trying to set up gnbd with multipath.
>Accoding to the gnbd_usage.txt file, I understand, that this should work with
>dm-multipath.
>But unfortunatly only the gfs part of the setup is descriped there.
>
>Has anybody experiance with this setup, especially how to set up
>multipath with multiple /dev/gnbd* and how to setup the multipath.conf file
>
>
>Thank you very much
>
>Hansj?rg Maurer
-- 
_________________________________________________________________

Dr.  Hansjoerg Maurer           | LAN- & System-Manager
                                |
Deutsches Zentrum               | DLR Oberpfaffenhofen
  f. Luft- und Raumfahrt e.V.   |
Institut f. Robotik             |
Postfach 1116                   | Muenchner Strasse 20
82230 Wessling                  | 82234 Wessling
Germany                         |
                                |
Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
__________________________________________________________________


There are 10 types of people in this world, 
those who understand binary and those who don't.






From teigland at redhat.com  Fri Apr 15 14:39:53 2005
From: teigland at redhat.com (David Teigland)
Date: Fri, 15 Apr 2005 22:39:53 +0800
Subject: [Linux-cluster] DB Clustering Question
In-Reply-To: <1113571189.6839.6.camel@althea.playway.net>
References: <1113571189.6839.6.camel@althea.playway.net>
Message-ID: <20050415143953.GB11756@redhat.com>

On Fri, Apr 15, 2005 at 09:19:48AM -0400, Matt wrote:
> What I really want to do is parallel processing with mySQL or
> Postgresql.  If I can't do that, then simply having multiple SQL servers
> share the same DB files is the next option.  Can anyone push me in the
> right direction?
> 
> One last question, does anyone have any experience with the Ingres
> database and its clustering features?

I believe Ingres is the only cluster database that's open source.  When I
looked a few months ago there was some work needed to hook it into our
cluster/lock managers, but that didn't look too bad as they were already
able to switch between different clustering/locking infrastrutures.

-- 
Dave Teigland  <teigland at redhat.com>



From lhh at redhat.com  Fri Apr 15 14:59:27 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 15 Apr 2005 10:59:27 -0400
Subject: [Linux-cluster] Configuration of a 2 node HA cluster with gfs
In-Reply-To: <425E2180.6060609@birger.sh>
References: <425E2180.6060609@birger.sh>
Message-ID: <1113577167.20618.155.camel@ayanami.boston.redhat.com>

On Thu, 2005-04-14 at 09:53 +0200, birger wrote:

> - Mount the disks permanently on both nodes using gfs (less chance of
> nuking the file systems because of a split-brain)

The way GFS forcefully prevents I/O (which protects data!) is via
"fencing" (fibre channel zoning or a remote power controller/integrated
power control, etc).  This prevents the block I/Os from hitting the
disks for a node which has died, and works with any file system (not
just GFS).

Fencing is required in order for CMAN to operate in any useful capacity
in 2-node mode.

Anyway, to make this short: You probably want fencing for your solution.

> - Perhaps also run NFS services permanently on both nodes, failing over
> only the IP address of the official NFS service. Should make failover
> even faster, but are there pitfalls to running multiple NFS servers off
> the same gfs file system? In addition to failing over the IP address, I
> would have to look into how to take along NFS file locks when doing a
> takeover.

With GFS, the file locking should just kind of "work", but the client
would be required to fail over.  I don't think the Linux NFS client can
do this, but I believe the Solaris one can... (correct me if I'm wrong
here).

Failing over just an IP may work, but there may be some issues as well.
In any case, we should certainly *make* it work if it doesn't at the
moment, eh? :)

With a pure NFS failover solution (ex: on ext3, w/o replicated cluster
locks), there needs to be some changes to nfsd, lockd, and rpc.statd in
order to make lock failover work seamlessly.


> Can anyone 'talk me through' the steps needed to get this up and running?

Well, there's a start of the issues.

You can use rgmanager to do the IP and Samba failover.  Take a look at
"rgmanager/src/daemons/tests/*.conf".  I don't know how well Samba
failover has been tested.

-- Lon



From chrisd at pearsoncmg.com  Fri Apr 15 16:31:01 2005
From: chrisd at pearsoncmg.com (Chris Darroch)
Date: Fri, 15 Apr 2005 12:31:01 -0400
Subject: [Linux-cluster] DB Clustering Question
In-Reply-To: <20050415143953.GB11756@redhat.com>
References: <1113571189.6839.6.camel@althea.playway.net>
	<20050415143953.GB11756@redhat.com>
Message-ID: <425FEC45.8060302@pearsoncmg.com>

Hi --

David Teigland wrote:

> I believe Ingres is the only cluster database that's open source.  When I
> looked a few months ago there was some work needed to hook it into our
> cluster/lock managers, but that didn't look too bad as they were already
> able to switch between different clustering/locking infrastrutures.

   That's very interesting -- I should have looked more closely
at Ingres R3!  :-)  Naturally, their site is mostly down now that
I want to look at it.  Seems like they adopted OpenDLM last year.

   I can't quite tell, but if the Ingres Grid Option is their "single
DB clustering" option, it seems to not support things like row-level
locks and "update mode locks".  (The Distributed Option appears to be
a DTP solution for heterogeneous DBs, and the Replicator Option one
for replicating between Ingres DBs, both based on two-phase commits.
It looks like you turn of two-phase commits when using the Replicator
and Grid Options together.)  Errors are mine due to overly quick
scanning of documents.


I wrote:

>    It's perhaps worth noting that PostgreSQL and Oracle face special
> complexities regarding data consistency and locking because they
> provide MVCC (Multi-Version Concurrency Control) ...

   One small pointless correction to my own tangent is that Oracle
actually calls their version MVRC and it works a little differently
than PostgreSQL's, but the rough idea is the same.

Chris.

-- 
GPG Key ID: 366A375B
GPG Key Fingerprint: 485E 5041 17E1 E2BB C263  E4DE C8E3 FA36 366A 375B



From fedora at nodata.co.uk  Fri Apr 15 16:46:09 2005
From: fedora at nodata.co.uk (nodata)
Date: Fri, 15 Apr 2005 18:46:09 +0200
Subject: [Linux-cluster] DB Clustering Question
In-Reply-To: <425FCD7E.3000705@pearsoncmg.com>
References: <1113571189.6839.6.camel@althea.playway.net>
	<425FCD7E.3000705@pearsoncmg.com>
Message-ID: <1113583569.3224.13.camel@sb-home.lan>

On Fri, 2005-04-15 at 10:19 -0400, Chris Darroch wrote:
> Matt wrote:
> 
> > What I really want to do is parallel processing with mySQL or
> > Postgresql.  If I can't do that, then simply having multiple SQL servers
> > share the same DB files is the next option.  Can anyone push me in the
> > right direction?
> 
>    I'm new to the list as well, but having just gone through the process
> of evaluating exactly this kind of problem, I have a few cents I can
> throw in.
> 
>    I think the very short answer to your question is that databases and
> multiple servers don't mix well at all, as a general rule, and if you
> need full transactional SQL support in a cluster, you're likely looking
> at a commerical solution.
> 
>    The fundamental problem is that transactional databases, of which
> SQL databases are a subset, need to ensure that all transactions occur
> atomically, and to do this, they need very robust, very fast locking
> subsystems.  For example, before updating a row in a table, a
> database process needs to be sure that it acquires a lock on that data
> first, so that other database processes handling other client requests
> don't read partially altered data.
> 
>    Now locking is hard enough to do when you have just one machine
> (either single CPU or multiple CPUs), but can be done quite effectively
> and efficiently through the use of in-memory mutexes and other such
> devices.  Oracle, for example, takes out a big chunk of shared memory,
> which all processes use to coordinate locking.
> 
>    Doing this in a cluster of machines is much, much more difficult.
> It's compounded by the problem that one or more machines could fail,
> or the network could fail in various ways, and the DB software must
> ensure that under no conditions does the data become corrupted.
> (See all the work involved in the GFS DLM, for example, involving
> handling "split brain" conditions and the such like.)
> 
>    Oracle RAC (Real Application Cluster) provides this functionality
> at considerable expense, for instance, by requiring that you have
> a high-speed interconnection network between your machines, and
> then by providing its own internal lock manager and cluster monitor
> and so forth.  Essentialy, many of the components of GFS are
> provided inside Oracle RAC, for its own purposes, but are unavailable
> to outside processes.  You can also run Oracle RAC on Linux, in
> various ways:
> 
> http://www.redhat.com/software/rha/gfs/
> http://www.oracle.com/technology/tech/linux/index.html
> http://www.veritas.com/van/articles/7655.jsp
> 
> If I understand the RedHat option correctly, Oracle relies on GFS
> to manage the shared storage in the cluster, but still uses its own
> lock manager, cluster monitor, etc., for its own internal cache
> management and transaction handling.  However, I haven't read the
> installation white paper, so I'm not sure about that.  (Note to
> RedHat folks: trying to register on the Web site leads to an
> access denied error for the /info/ page.)
> 
>    Open source SQL databases like PostgreSQL and MySQL just don't
> have this kind of feature, so far as I can determine.  MySQL provides a
> cluster mechanism over regular TCP, but as far as I could tell from the
> documentation, this works by keeping the entire database in RAM on each
> cluster node:
> 
> http://dev.mysql.com/doc/mysql/en/multi-hardware-software-network.html
> 
>    PostgreSQL can be run in a cluster by emulating a single operating
> system underneath it, using high-speed interconnections and special
> kernel modifications:
> 
> http://www.linuxlabs.com/clusgres.html
> 
>    I don't know much about Ingres myself, but I didn't see anything
> about clustering for that, either.
> 
>    It's perhaps worth noting that PostgreSQL and Oracle face special
> complexities regarding data consistency and locking because they
> provide MVCC (Multi-Version Concurrency Control), which means that
> each database client sees a "snapshot" of the entire database
> as it was when they began their transaction.  As long as their
> transaction remains active, the database retains previous versions
> of all data modified by all other active transactions, so that
> the snapshot remains accurate to a past point in time.  Only once
> the transaction has closed can the database clean up old versions of
> data.  This is subtly different from just providing row-level locking
> in a table; if one transaction is slowing reading through all the
> rows of a table while another one performs updates of selected rows,
> the old versions of the updated rows are kept around until the
> reader's transaction closes, in case they are needed to provide an
> accurate view of what the data in the table looked like when the
> reader's transaction began.  So that's all just to say that the
> business of locking and shuffling data around is especially complex
> for such databases, and doing it in a cluster even more so.
> 
>    What you are able to do with the available options depends partly
> on your requirements, obviously.  If you don't mind having multiple
> read-only copies of your database files, and allowing them to be
> somewhat out of date, there are various ways you could replicate
> your data files from a master read-write node to multiple read-only
> nodes.  You'd want to ensure that the copying process performed
> the necessary interactions with the master database to ensure that
> it never copied partially complete data files; performing a hot
> backup and then replicating those files to the read-only nodes would
> work.
> 
>    Another related option if you don't mind having read-only and
> slightly out-of-date copies is to use memcached:
> 
> http://www.danga.com/memcached/
> 
> This functions as a data cache between your client programs and
> the database, and spreads the data around to multiple machines.
> But obviously write requests need to go to the master database,
> and then be replicated to the caches, and there's a period of time
> when you might not read up-to-date data from the cache.  But this
> may be OK for your application.
> 
>    If you need true full transactional SQL support spread across a
> cluster, I believe you'll have to look at Oracle or another commerical
> solution like the ClusGres one I referenced above.  I'd love to
> stand correctly, though, if anyone knows more about this.
> 
> Chris.
> 

If the database is mainly used for reads, you should check out emic
networks' product. It will allow you to cluster mysql across multiple
boxes, and if a node fails, it doesn't matter. If you want to add more
boxes you can too. It's load balanced, and writes are atomic across the
cluster. Interestingly, it does NOT require an in-RAM database.

See http://www.emicnetworks.com/



From srinisan at fmailbox.com  Fri Apr 15 17:24:44 2005
From: srinisan at fmailbox.com (Srini Sankaran)
Date: Fri, 15 Apr 2005 10:24:44 -0700
Subject: [Linux-cluster] Can LOCK_NOLOCK be used in this situation?
Message-ID: <37c450ffae096dcee27edda8e07e5d7a@fmailbox.com>

I don't have a GFS cluster running right now. I'd appreciate some 
guidance on using GFS for the following situation:

I need a cluster of nodes to have read and write access to a scalable 
and common pool of storage connected to an FC SAN. The entire pool of 
storage must appear as one single file system to the cluster nodes. So 
far so good. GFS fits. But...

The application running on each node is partitioned in such a way that 
at any given moment, only one node will need read / write access to a 
directory and its descendant file tree.

For example, let's say the file system is called "/big" and it has 
directories "a", "b", ... "z". Let's say that I have cluster node "1", 
"2", and "3". When node 1 needs access to "/big/a", the other nodes "2" 
and "3", won't need access to "/big/a". Those nodes will be reading and 
writing in to "/big/b" or "/big/c" or something else. In general, 
"/big/a" and other directories could have several million files.

A few minutes to hours later, node 2 might take over the read / write 
responsibilities for "/big/a", and node 1 might move over to "/big/b", 
etc.

 From reading the GFS documentation, it certainly appears that a 
standard GFS with locking (single or redundant servers) would work in 
this situation. But, I would like to avoid designating any single or 
multiple servers as lock servers. This is because the cluster is very 
dynamic. Nodes can constantly be added or removed, and the system 
administration environment isn't conducive for designating lock servers 
and protecting them. Besides, I am wondering why the lock servers 
should work so hard to maintain all the locks on the millions of files 
when I know 100% that no other node is going to access the files 
simultaneously.

So, my question is: Can I simply use LOCK_NOLOCK in this situation and 
avoid any lock server? Maybe the answer is no because the documentation 
warns "Do not allow multiple nodes to mount the same file system while 
LOCK_NOLOCK is used. Doing so causes one or more nodes to panic their 
kernels, and may cause file system corruption".

I am still asking the question because of this partitioned file system 
access characteristic of my application. Is this warning still valid if 
I can guarantee that no two files or directories will be accessed by 
two different nodes simultaneously? If I can't do LOCK_NOLOCK, is there 
any other idea I can use here?

Thanks for your time



From kpreslan at redhat.com  Fri Apr 15 18:02:10 2005
From: kpreslan at redhat.com (Ken Preslan)
Date: Fri, 15 Apr 2005 13:02:10 -0500
Subject: [Linux-cluster] Can LOCK_NOLOCK be used in this situation?
In-Reply-To: <37c450ffae096dcee27edda8e07e5d7a@fmailbox.com>
References: <37c450ffae096dcee27edda8e07e5d7a@fmailbox.com>
Message-ID: <20050415180210.GA19976@potassium.msp.redhat.com>

On Fri, Apr 15, 2005 at 10:24:44AM -0700, Srini Sankaran wrote:
...
> A few minutes to hours later, node 2 might take over the read / write 
> responsibilities for "/big/a", and node 1 might move over to "/big/b", 
> etc.
...
> So, my question is: Can I simply use LOCK_NOLOCK in this situation and 
> avoid any lock server? Maybe the answer is no because the documentation 
> warns "Do not allow multiple nodes to mount the same file system while 
> LOCK_NOLOCK is used. Doing so causes one or more nodes to panic their 
> kernels, and may cause file system corruption".
> 
> I am still asking the question because of this partitioned file system 
> access characteristic of my application. Is this warning still valid if 
> I can guarantee that no two files or directories will be accessed by 
> two different nodes simultaneously? If I can't do LOCK_NOLOCK, is there 
> any other idea I can use here?

Nolock won't work here.  Even if the directory tree is partitioned between
nodes, the allocation bitmaps aren't.  Allocate enough and you'll see
contention there.  And without locking, you'll see corruption there too.

You also need locking to manage the transitions when a machine switches
directories.  Caches need to be flushed and invalidated.  The locking
makes that happen.

If you're reluctance to use locking is just because you don't want
dedicated GULM lock servers, you might want to try the DLM instead.

-- 
Ken Preslan <kpreslan at redhat.com>



From fabbione at fabbione.net  Sat Apr 16 08:33:44 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Sat, 16 Apr 2005 10:33:44 +0200 (CEST)
Subject: [Linux-cluster] [PATCH] Fix usage of sk_alloc in cman-kernel
	(2.6.12rc2)
Message-ID: <20050416083344.1BCE02B9F@trider-g7.fabbione.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi everybody,

the 26th of March 2005 Arnaldo Carvalho de Melo commited a quite big change to
sk_alloc:

ChangeSet 1.2181.42.2 2005/03/26 20:04:49 acme at toy.ghostprotocols.net
  [NET] make all protos partially use sk_prot
  
  sk_alloc_slab becomes proto_register, that receives a struct proto not necessarily
  completely filled, but at least with the proto name, owner and obj_size (aka proto
  specific sock size), with this we can remove the struct sock sk_owner and sk_slab,
  using sk->sk_prot->{owner,slab} instead.
  
  This patch also makes sk_set_owner not necessary anymore, as at sk_alloc time we
  have now access to the struct proto onwer and slab members, so we can bump the
  module refcount exactly at sock allocation time. 
  
  Another nice "side effect" is that this patch removes the generic sk_cachep slab
  cache, making the only last two protocols that used it use just kmalloc, informing
  a struct proto obj_size equal to sizeof(struct sock).
  
  Ah, almost forgot that with this patch it is very easy to use a slab cache, as it is
  now created at proto_register time, and all protocols need to use proto_register,
  so its just a matter of switching the second parameter of proto_register to '1', heck,
  this can be done even at module load time with some small additional patch. 
  
  Another optimization that will be possible in the future is to move the sk_protocol
  and sk_type struct sock members to struct proto, but this has to wait for all protocols
  to move completely to sk_prot.
  
  This changeset also introduces /proc/net/protocols, that lists the registered protocols
  details, some may seem excessive, but I'd like to keep them while working on further 
  struct sock hierarchy work and also to realize which protocols are old ones, i.e. that
  still use struct proto_ops, etc, yeah, this is a bit of an exaggeration, as all protos
  still use struct proto_ops, but in time the idea is to move all to use sk->sk_prot and
  make the proto_ops infrastructure be shared among all protos, reducing one level of
  indirection.
  
  Signed-off-by: Arnaldo Carvalho de Melo <acme at conectiva.com.br>
  Signed-off-by: David S. Miller <davem at davemloft.net>

The same change needs to be propagated to cman-kernel (probably more,
but i am working on one module at a time).
Here is a preliminary patch that works for me.

Please review before applying.

Signed-off-by: Fabio M. Di Nitto <fabbione at ubuntu.com>

Index: cnxman.c
===================================================================
RCS file: /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v
retrieving revision 1.55
diff -u -r1.55 cnxman.c
- --- cnxman.c	5 Apr 2005 13:43:09 -0000	1.55
+++ cnxman.c	16 Apr 2005 08:20:42 -0000
@@ -66,8 +66,8 @@
 extern void cman_set_realtime(struct task_struct *tsk, int prio);
 
 static struct proto_ops cl_proto_ops;
+static struct proto cl_proto;
 static struct sock *master_sock;
- -static kmem_cache_t *cluster_sk_cachep;
 
 /* Pointer to the pseudo node that maintains quorum in a 2node system */
 struct cluster_node *quorum_device = NULL;
@@ -918,14 +918,14 @@
 	return;
 }
 
- -static struct sock *cl_alloc_sock(struct socket *sock, int gfp)
+static struct sock *cl_alloc_sock(struct socket *sock, int gfp, int protocol)
 {
 	struct sock *sk;
 	struct cluster_sock *c;
 
 	if ((sk =
- -	     sk_alloc(AF_CLUSTER, gfp, sizeof (struct cluster_sock),
- -		      cluster_sk_cachep)) == NULL)
+	     sk_alloc(AF_CLUSTER, gpf, &cl_proto,
+		      1)) == NULL)
 		goto no_sock;
 
 	if (sock) {
@@ -937,6 +937,7 @@
 	sk->sk_no_check = 1;
 	sk->sk_family = PF_CLUSTER;
 	sk->sk_allocation = gfp;
+	sk->sk_protocol = protocol;
 
 	c = cluster_sk(sk);
 	c->port = 0;
@@ -1031,7 +1032,7 @@
 	if (!atomic_read(&cnxman_running) && protocol != CLPROTO_MASTER)
 		return -ENETDOWN;
 
- -	if ((sk = cl_alloc_sock(sock, GFP_KERNEL)) == NULL)
+	if ((sk = cl_alloc_sock(sock, GFP_KERNEL, protocol)) == NULL)
 		return -ENOBUFS;
 
 	sk->sk_protocol = protocol;
@@ -4155,6 +4156,12 @@
 	.owner       = THIS_MODULE,
 };
 
+static struct proto cl_proto = {
+	.name	     = "CMAN",
+	.owner	     = THIS_MODULE,
+	.obj_size    = sizeof(struct cluster_sock)
+};
+
 #ifdef MODULE
 MODULE_DESCRIPTION("Cluster Connection and Service Manager");
 MODULE_AUTHOR("Red Hat, Inc");
@@ -4166,19 +4173,14 @@
 	printk("CMAN %s (built %s %s) installed\n",
 	       CMAN_RELEASE_NAME, __DATE__, __TIME__);
 
- -	if (sock_register(&cl_family_ops)) {
- -		printk(KERN_INFO "Unable to register cluster socket type\n");
+	if (proto_register(&cl_proto,0) < 0) {
+		printk(KERN_INFO "Unable to register cluster protocol type\n");
 		return -1;
 	}
 
- -	/* allocate our sock slab cache */
- -	cluster_sk_cachep = kmem_cache_create("cluster_sock",
- -					      sizeof (struct cluster_sock), 0,
- -					      SLAB_HWCACHE_ALIGN, 0, 0);
- -	if (!cluster_sk_cachep) {
- -		printk(KERN_CRIT
- -		       "cluster_init: Cannot create cluster_sock SLAB cache\n");
- -		sock_unregister(AF_CLUSTER);
+	if (sock_register(&cl_family_ops)) {
+		proto_unregister(&cl_proto);
+		printk(KERN_INFO "Unable to register cluster socket type\n");
 		return -1;
 	}
 
@@ -4234,7 +4236,7 @@
 	cnxman_ioctl32_exit();
 #endif
 	sock_unregister(AF_CLUSTER);
- -	kmem_cache_destroy(cluster_sk_cachep);
+	proto_unregister(&cl_proto);
 }
 
 module_init(cluster_init);
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iQIVAwUBQmDNYFA6oBJjVJ+OAQJzyQ/+PjjPRmdqGKzpsms+96wTSzw5iaEsZHx4
9tZF6nbVBaCoygB9B0xkR0ra37DwZg+vWHOlzcS6HoHkiz0LveeXWb6Xu9bsTu2a
/9pIFSXFAaiwJTCE7FEHamHgm7yf2SyVyL2BS+05UzvYsfoG9JTIX2b8gsBtfb5J
qF5sZIqYrcrGn3wNLqxID+qgb1pKcgQfUGOWAVrdVy0xP2xClJQKSyFCsRcwCUmW
2qzIPW3DtBe996rlwVZAkupvHfueqGTkXNjhockah37+jO0KivcUA6ej2m+ZO1mk
Rc2Q5mEvjsq5UHHFXO27BomLXNYXdge9HZ9cAvip4tGvlby2PA90R0txTECKUbFK
jJCcfg9l0rS+OKGlCSEnyC52UIlU67lrvXiPvUFhyd0VMfVpaSFHe4NYJZbx0iQx
AFRcxaCkSLpZU78b4NpSig+qLz4ynLYcyPRXxL+WZpqRrbjaGnPdjkkwaX9hPqzs
cGLHMhgS8ImMZK6s67hutTIBXfgYZA7cdu9VzR+zITcssfuxowfCEMZOR/ixaD7+
jYSzS89NTHKhv0cAppu0JWNwC5vIKYu4WBxkRzTjjU8OqsozaSnvoDlQlyfn7Ffb
kqbXeJopnMHY1NW8DyazNRtrdArlP/Jw+7gi00S7LVDRlOpboxG9g5NDXhzTzmdP
goIHcBuTlWk=
=Dfi6
-----END PGP SIGNATURE-----



From fabbione at fabbione.net  Sat Apr 16 18:35:34 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Sat, 16 Apr 2005 20:35:34 +0200
Subject: [Linux-cluster] [PATCH] Fix usage of sk_alloc in cman-kernel
	(2.6.12rc2)
In-Reply-To: <20050416083344.1BCE02B9F@trider-g7.fabbione.net>
References: <20050416083344.1BCE02B9F@trider-g7.fabbione.net>
Message-ID: <42615AF6.20608@fabbione.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Fabio Massimo Di Nitto wrote:
> Hi everybody,
> 

> @@ -918,14 +918,14 @@
>  	return;
>  }
> 
> -static struct sock *cl_alloc_sock(struct socket *sock, int gfp)
> +static struct sock *cl_alloc_sock(struct socket *sock, int gfp, int protocol)
>  {
>  	struct sock *sk;
>  	struct cluster_sock *c;
> 
>  	if ((sk =
> -	     sk_alloc(AF_CLUSTER, gfp, sizeof (struct cluster_sock),
> -		      cluster_sk_cachep)) == NULL)
> +	     sk_alloc(AF_CLUSTER, gpf, &cl_proto,
> +		      1)) == NULL)
>  		goto no_sock;
> 
>  	if (sock) {

Meh.. sorry.. i just realized that i did a typo in this hunk s/gpf/gfp.

fabio

- --
Self-Service law:
The last available dish of the food you have decided to eat, will be
inevitably taken from the person in front of you.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCYVr1hCzbekR3nhgRAqMSAJ90YRnv4frDEDyqBSQeJ5xm+1h8/wCfUbny
Ed+7DiMfJ0SWD01pv1uCKKA=
=h65P
-----END PGP SIGNATURE-----



From cjkovacs at verizon.net  Sun Apr 17 11:01:35 2005
From: cjkovacs at verizon.net (Corey Kovacs)
Date: Sun, 17 Apr 2005 07:01:35 -0400
Subject: [Linux-cluster] GFS 6.0.24 hangs machine....
Message-ID: <200504170701.36039.cjkovacs@verizon.net>

Hello,

I've got a 5 node GFS cluster (RHEL3u4, GFS 6.0.2-24, kernel 2.4.21-24.0.1) 
with 3 volumes, one of which is approx 500GB and contains several thousand 
small files. When I do a find on that volume, or slocate is run via it's cron 
job, or I rsync that volume, the node used to access the volume get's into a 
state where it cannot fork anymore and nothing can be done with the machine 
until it is restarted (usually requiring a "fence_node" from another 
machine). The cluster is configured with 3 of the nodes acting as lock 
managers, using DL360's with 2GB ram each and qlogic 2342 dual port cards 
connected to an msa1000. The journals are not on there own volumes and the 
defaults are used for mounting. 


Is this a known problem? I've searched for other posts with this problem but
have not had any luck with it. Any ideas as to what might be causing this?


Thanks


Corey



From pcaulfie at redhat.com  Mon Apr 18 07:56:23 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 18 Apr 2005 08:56:23 +0100
Subject: [Linux-cluster] [PATCH] Fix usage of sk_alloc in cman-kernel
	(2.6.12rc2)
In-Reply-To: <20050416083344.1BCE02B9F@trider-g7.fabbione.net>
References: <20050416083344.1BCE02B9F@trider-g7.fabbione.net>
Message-ID: <20050418075623.GB6015@tykepenguin.com>

On Sat, Apr 16, 2005 at 10:33:44AM +0200, Fabio Massimo Di Nitto wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi everybody,
> 
> the 26th of March 2005 Arnaldo Carvalho de Melo commited a quite big change to
> sk_alloc:

Thanks.

This change is in my tree. I'll commit it with the other 2.6.12pre2 stuff
shortly.

-- 

patrick



From fabbione at fabbione.net  Mon Apr 18 08:00:47 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Mon, 18 Apr 2005 10:00:47 +0200
Subject: [Linux-cluster] [PATCH] Fix usage of sk_alloc in cman-kernel
	(2.6.12rc2)
In-Reply-To: <20050418075623.GB6015@tykepenguin.com>
References: <20050416083344.1BCE02B9F@trider-g7.fabbione.net>
	<20050418075623.GB6015@tykepenguin.com>
Message-ID: <4263692F.5010906@fabbione.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Patrick Caulfield wrote:
> On Sat, Apr 16, 2005 at 10:33:44AM +0200, Fabio Massimo Di Nitto wrote:
> 
>>-----BEGIN PGP SIGNED MESSAGE-----
>>Hash: SHA1
>>
>>Hi everybody,
>>
>>the 26th of March 2005 Arnaldo Carvalho de Melo commited a quite big change to
>>sk_alloc:
> 
> 
> Thanks.
> 
> This change is in my tree. I'll commit it with the other 2.6.12pre2 stuff
> shortly.
> 

Cool, i can confirm that fix works fine on i386 and it builds fine
(sorry but i can't test) on ppc/amd64/sparc64/ia64/hppa.

Fabio

PS is anybody actually building cluster/ with gcc-4.0?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCY2kthCzbekR3nhgRAq2FAJ9sxnTu58c6/zC2VIylubnQvlpa5QCgpOGz
wF+dVnTclMMKsTXvqAzeBbc=
=aGHF
-----END PGP SIGNATURE-----



From ptr at poczta.fm  Mon Apr 18 13:11:10 2005
From: ptr at poczta.fm (ptr at poczta.fm)
Date: 18 Apr 2005 15:11:10 +0200
Subject: [Linux-cluster] Problems after upgrade :(
Message-ID: <20050418131110.780E43B21F0@poczta.interia.pl>

    Hello.

I installed the newest CVS version from the scratch
(because of noticed problems with old libs and binaries
remaining in system directories even after "make install" 
with certain components. 
Anyways, now I'm getting those entries as below after node startup
and even during normal work:

GFS: Trying to join cluster "lock_dlm", "cluster1:eva"
scheduling while atomic: cman_comms/0x00000001/8808
 [<c039a4d2>] schedule+0xbc2/0xbd0
 [<c01156fe>] __wake_up+0x3e/0x60
 [<c039ba0f>] _spin_unlock_irqrestore+0xf/0x30
 [<f8bb5759>] queue_message+0x109/0x120 [cman]
 [<f8bbe06d>] add_barrier_callback+0x7d/0x160 [cman]
 [<f8bbe1a0>] callback_startdone_barrier_new+0x20/0x30 [cman]
 [<f8bb6d27>] check_barrier_complete_phase2+0xc7/0x110 [cman]
 [<f8bb6e45>] process_barrier_msg+0xa5/0x120 [cman]
 [<f8bb306f>] process_incoming_packet+0x18f/0x290 [cman]
 [<f8bb21b1>] receive_message+0xd1/0xf0 [cman]
 [<f8bb235c>] cluster_kthread+0x18c/0x340 [cman]
 [<c0115630>] default_wake_function+0x0/0x20
 [<f8bb21d0>] cluster_kthread+0x0/0x340 [cman]
 [<c01009f5>] kernel_thread_helper+0x5/0x10
scheduling while atomic: cman_comms/0x00000001/8808
 [<c039a4d2>] schedule+0xbc2/0xbd0
 [<f8bb26de>] start_ack_timer+0x2e/0x40 [cman]
 [<f8bbe06d>] add_barrier_callback+0x7d/0x160 [cman]
 [<f8bbe1a0>] callback_startdone_barrier_new+0x20/0x30 [cman]
 [<f8bb6d27>] check_barrier_complete_phase2+0xc7/0x110 [cman]
 [<f8bb6e45>] process_barrier_msg+0xa5/0x120 [cman]
 [<f8bb306f>] process_incoming_packet+0x18f/0x290 [cman]
 [<f8bb21b1>] receive_message+0xd1/0xf0 [cman]
 [<f8bb235c>] cluster_kthread+0x18c/0x340 [cman]
 [<c0115630>] default_wake_function+0x0/0x20
 [<f8bb21d0>] cluster_kthread+0x0/0x340 [cman]
 [<c01009f5>] kernel_thread_helper+0x5/0x10

Besides, sometimes when I reboot one of the nodes
(it's 2-nodes cluster running 2.6.11.7), it won't start up
showing on console messages like "CMANsendmsg failed: "-101".
I have to reboot again to start the node fully up.
    Any hints on what's wrong?

TIA for your help, best regards

Piotr




------------------------------------------------------------------
Teraz na tapecie mamy najwiekszego z silaczy. 
Sciagnij >> http://link.interia.pl/f1873 <<



From rajkum2002 at rediffmail.com  Mon Apr 18 14:36:42 2005
From: rajkum2002 at rediffmail.com (Raj  Kumar)
Date: 18 Apr 2005 14:36:42 -0000
Subject: [Linux-cluster] Out of Memory Problem
Message-ID: <20050418143642.26172.qmail@webmail47.rediffmail.com>

Hi everyone,

One of our GFS Linux servers has crashed twice yesterday. The log messages indicate the server ran out of memory and started killing processes:

Out of Memory: Killed process 21188 (sshd).
Out of Memory: Killed process 5215 (xfs).

The server is a HP DL380 with dual Xeon 3.06 GHz processor, 1GB RAM, 2GB swap space running RHEL 3.0- kernel 2.4.21-27.0.1.ELsmp. The server runs NIS, GFS, SSHD and samba services. After the first crash the server didn?t start due to file system corruption. The problem has been corrected and server returned to operation yesterday evening. Today's log indicates the server ran out of memory and killed processes again this morning. This out of memory problem is recurring speciallyl when users are accessing the storage mounted using GFS.

Where can I start to debug the problem? 

free -m output:

             total       used       free     shared    buffers     cached
Mem:          1001        986         14          0          1         79
-/+ buffers/cache:        905         95
Swap:         1996         49       1946

I don't understand what's happening to the total 1GB memory. This is the free output that happened seconds before crash (I had swatch set up to log the statistics the moment it sees OOM messages). PS output doesn't show any process taking significant portion of memory either. Since this is happening only when users are using GFS heavily I suspect it is the problem. But how do I verify it? Is 1GB too small for a GFS server?

I found that another user has seen the same problem before:
https://www.redhat.com/archives/linux-cluster/2005-January/msg00099.html

GFS setup was fine and all our tests passed. We then moved it to production and it immediately failed after running for two days. Your help is very much appreciated!! The problem seems to be reproducible. So if you need any logs I can rerun what our users did at the time of crash.

Thanks,
Raj

================== Log ============================

Apr  7 10:49:15 server1 kernel: Mem-info:
Apr  7 10:49:15 server1 kernel: Zone:DMA freepages:  2792 min:     0 low:     0 high:     0
Apr  7 10:49:15 server1 kernel: Zone:Normal freepages:   382 min:   766 low:  4031 high:  5791
Apr  7 10:49:15 server1 kernel: Zone:HighMem freepages:   287 min:   255 low:   510 high:   765
Apr  7 10:49:15 server1 kernel: Free pages:        3461 (   287 HighMem)
Apr  7 10:49:15 server1 kernel: ( Active: 22389/6071, inactive_laundry: 889, inactive_clean: 943, free: 3461 )
Apr  7 10:49:15 server1 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2792
Apr  7 10:49:15 server1 kernel:   aa:6 ac:13 id:292 il:43 ic:0 fr:382
Apr  7 10:49:15 server1 kernel:   aa:15159 ac:7211 id:5769 il:856 ic:943 fr:287
Apr  7 10:49:15 server1 kernel: 2*4kB 1*8kB 3*16kB 3*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11168kB) Apr  7 10:49:15 server1 kernel: 0*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1528kB) Apr  7 10:49:15 server1 kernel: 33*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1148kB) Apr  7 10:49:15 server1 kernel: Swap cache: add 1777, delete 1292, find 20425/20536, race 0+0 Apr  7 10:49:15 server1 kernel: 218499 pages of slabcache Apr  7 10:49:15 server1 kernel: 216 pages of kernel stacks Apr  7 10:49:16 server1 kernel: 0 lowmem pagetables, 489 highmem pagetables
Apr  7 10:49:16 server1 kernel: Free swap:       2038872kB
Apr  7 10:49:16 server1 kernel: 262138 pages of RAM Apr  7 10:49:16 server1 kernel: 32762 pages of HIGHMEM Apr  7 10:49:16 server1 kernel: 5780 reserved pages Apr  7 10:49:16 server1 kernel: 16752 pages shared Apr  7 10:49:16 server1 kernel: 485 pages swap cached Apr  7 10:49:16 server1 kernel: Out of Memory: Killed process 21188 (sshd).
Apr  7 10:49:16 server1 kernel: Out of Memory: Killed process 21188 (sshd).
Apr  7 10:49:20 server1 kernel: Mem-info:
Apr  7 10:49:20 server1 kernel: Zone:DMA freepages:  2792 min:     0 low:     0 high:     0
Apr  7 10:49:20 server1 kernel: Zone:Normal freepages:   382 min:   766 low:  4031 high:  5791
Apr  7 10:49:20 server1 kernel: Zone:HighMem freepages:   291 min:   255 low:   510 high:   765
Apr  7 10:49:20 server1 kernel: Free pages:        3465 (   291 HighMem)
Apr  7 10:49:20 server1 kernel: ( Active: 21743/6636, inactive_laundry: 896, inactive_clean: 1049, free: 3465 )
Apr  7 10:49:20 server1 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2792
Apr  7 10:49:20 server1 kernel:   aa:6 ac:36 id:265 il:40 ic:0 fr:382
Apr  7 10:49:20 server1 kernel:   aa:14479 ac:7222 id:6365 il:862 ic:1049 fr:291
Apr  7 10:49:20 server1 kernel: 2*4kB 1*8kB 3*16kB 3*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11168kB) Apr  7 10:49:20 server1 kernel: 28*4kB 1*8kB 2*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1528kB) Apr  7 10:49:20 server1 kernel: 37*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1164kB) Apr  7 10:49:20 server1 kernel: Swap cache: add 1777, delete 1292, find 20425/20536, race 0+0 Apr  7 10:49:21 server1 kernel: 218570 pages of slabcache Apr  7 10:49:21 server1 kernel: 196 pages of kernel stacks Apr  7 10:49:21 server1 kernel: 0 lowmem pagetables, 404 highmem pagetables
Apr  7 10:49:21 server1 kernel: Free swap:       2038872kB
Apr  7 10:49:21 server1 kernel: 262138 pages of RAM Apr  7 10:49:21 server1 kernel: 32762 pages of HIGHMEM Apr  7 10:49:21 server1 kernel: 5780 reserved pages Apr  7 10:49:22 server1 kernel: 13904 pages shared Apr  7 10:49:22 server1 kernel: 485 pages swap cached Apr  7 10:49:22 server1 kernel: Out of Memory: Killed process 5215 (xfs).
Apr  7 10:49:22 server1 kernel: Out of Memory: Killed process 5215 (xfs).
.........
............
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050418/4763474f/attachment.htm>

From pcaulfie at redhat.com  Mon Apr 18 14:48:12 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Mon, 18 Apr 2005 15:48:12 +0100
Subject: [Linux-cluster] Problems after upgrade :(
In-Reply-To: <20050418131110.780E43B21F0@poczta.interia.pl>
References: <20050418131110.780E43B21F0@poczta.interia.pl>
Message-ID: <20050418144812.GH6015@tykepenguin.com>

On Mon, Apr 18, 2005 at 03:11:10PM +0200, ptr at poczta.fm wrote:
>     Hello.
> 
> I installed the newest CVS version from the scratch
> (because of noticed problems with old libs and binaries
> remaining in system directories even after "make install" 
> with certain components. 
> Anyways, now I'm getting those entries as below after node startup
> and even during normal work:

Head of CVS is not a good thing to use. Checkout the RHEL4 branch instead.

> 
> Besides, sometimes when I reboot one of the nodes
> (it's 2-nodes cluster running 2.6.11.7), it won't start up
> showing on console messages like "CMANsendmsg failed: "-101".
> I have to reboot again to start the node fully up.
>     Any hints on what's wrong?
> 

well, -101 is "Network is unreachable" so check that the network is correctly
configure and full up before starting cman.

-- 

patrick



From mrc at linuxplatform.org  Mon Apr 18 15:00:02 2005
From: mrc at linuxplatform.org (Matt)
Date: Mon, 18 Apr 2005 11:00:02 -0400
Subject: [Linux-cluster] DB Clustering Question
In-Reply-To: <1113571189.6839.6.camel@althea.playway.net>
References: <1113571189.6839.6.camel@althea.playway.net>
Message-ID: <1113836403.6865.34.camel@althea.playway.net>

Thank you to everyone for the replies to my questions about clustering.
I'll let you know what option we end up going with.

-- 
Matt <mrc at linuxplatform.org>



From rajkum2002 at rediffmail.com  Mon Apr 18 17:04:38 2005
From: rajkum2002 at rediffmail.com (Raj  Kumar)
Date: 18 Apr 2005 17:04:38 -0000
Subject: [Linux-cluster] Out of Memory Problem
Message-ID: <20050418170438.7970.qmail@webmail47.rediffmail.com>

Hi everone,

cat /proc/slabinfo:
  
size-64           4825410 4825410    128 160847 160847    1 : 1008  252

This seems to be unusal... size-64 slab is consuming upto 643MB of RAM. This number seems to increase slowly... how to track which process is requesting the objects from this slab? Does anyone know if there is a bug related to this in RH 2.4.21-27.0.1.ELsmp kernel?

Thank you,
Raj  
 ?


On Mon, 18 Apr 2005 Raj  Kumar wrote :
>Hi everyone,
>
>One of our GFS Linux servers has crashed twice yesterday. The log messages indicate the server ran out of memory and started killing processes:
>
>Out of Memory: Killed process 21188 (sshd).
>Out of Memory: Killed process 5215 (xfs).
>
>The server is a HP DL380 with dual Xeon 3.06 GHz processor, 1GB RAM, 2GB swap space running RHEL 3.0- kernel 2.4.21-27.0.1.ELsmp. The server runs NIS, GFS, SSHD and samba services. After the first crash the server didn?t start due to file system corruption. The problem has been corrected and server returned to operation yesterday evening. Today's log indicates the server ran out of memory and killed processes again this morning. This out of memory problem is recurring speciallyl when users are accessing the storage mounted using GFS.
>
>Where can I start to debug the problem?
>
>free -m output:
>
>              total       used       free     shared    buffers     cached
>Mem:          1001        986         14          0          1         79
>-/+ buffers/cache:        905         95
>Swap:         1996         49       1946
>
>I don't understand what's happening to the total 1GB memory. This is the free output that happened seconds before crash (I had swatch set up to log the statistics the moment it sees OOM messages). PS output doesn't show any process taking significant portion of memory either. Since this is happening only when users are using GFS heavily I suspect it is the problem. But how do I verify it? Is 1GB too small for a GFS server?
>
>I found that another user has seen the same problem before:
>https://www.redhat.com/archives/linux-cluster/2005-January/msg00099.html
>
>GFS setup was fine and all our tests passed. We then moved it to production and it immediately failed after running for two days. Your help is very much appreciated!! The problem seems to be reproducible. So if you need any logs I can rerun what our users did at the time of crash.
>
>Thanks,
>Raj
>
>================== Log ============================
>
>Apr  7 10:49:15 server1 kernel: Mem-info:
>Apr  7 10:49:15 server1 kernel: Zone:DMA freepages:  2792 min:     0 low:     0 high:     0
>Apr  7 10:49:15 server1 kernel: Zone:Normal freepages:   382 min:   766 low:  4031 high:  5791
>Apr  7 10:49:15 server1 kernel: Zone:HighMem freepages:   287 min:   255 low:   510 high:   765
>Apr  7 10:49:15 server1 kernel: Free pages:        3461 (   287 HighMem)
>Apr  7 10:49:15 server1 kernel: ( Active: 22389/6071, inactive_laundry: 889, inactive_clean: 943, free: 3461 )
>Apr  7 10:49:15 server1 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2792
>Apr  7 10:49:15 server1 kernel:   aa:6 ac:13 id:292 il:43 ic:0 fr:382
>Apr  7 10:49:15 server1 kernel:   aa:15159 ac:7211 id:5769 il:856 ic:943 fr:287
>Apr  7 10:49:15 server1 kernel: 2*4kB 1*8kB 3*16kB 3*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11168kB) Apr  7 10:49:15 server1 kernel: 0*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1528kB) Apr  7 10:49:15 server1 kernel: 33*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1148kB) Apr  7 10:49:15 server1 kernel: Swap cache: add 1777, delete 1292, find 20425/20536, race 0+0 Apr  7 10:49:15 server1 kernel: 218499 pages of slabcache Apr  7 10:49:15 server1 kernel: 216 pages of kernel stacks Apr  7 10:49:16 server1 kernel: 0 lowmem pagetables, 489 highmem pagetables
>Apr  7 10:49:16 server1 kernel: Free swap:       2038872kB
>Apr  7 10:49:16 server1 kernel: 262138 pages of RAM Apr  7 10:49:16 server1 kernel: 32762 pages of HIGHMEM Apr  7 10:49:16 server1 kernel: 5780 reserved pages Apr  7 10:49:16 server1 kernel: 16752 pages shared Apr  7 10:49:16 server1 kernel: 485 pages swap cached Apr  7 10:49:16 server1 kernel: Out of Memory: Killed process 21188 (sshd).
>Apr  7 10:49:16 server1 kernel: Out of Memory: Killed process 21188 (sshd).
>Apr  7 10:49:20 server1 kernel: Mem-info:
>Apr  7 10:49:20 server1 kernel: Zone:DMA freepages:  2792 min:     0 low:     0 high:     0
>Apr  7 10:49:20 server1 kernel: Zone:Normal freepages:   382 min:   766 low:  4031 high:  5791
>Apr  7 10:49:20 server1 kernel: Zone:HighMem freepages:   291 min:   255 low:   510 high:   765
>Apr  7 10:49:20 server1 kernel: Free pages:        3465 (   291 HighMem)
>Apr  7 10:49:20 server1 kernel: ( Active: 21743/6636, inactive_laundry: 896, inactive_clean: 1049, free: 3465 )
>Apr  7 10:49:20 server1 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2792
>Apr  7 10:49:20 server1 kernel:   aa:6 ac:36 id:265 il:40 ic:0 fr:382
>Apr  7 10:49:20 server1 kernel:   aa:14479 ac:7222 id:6365 il:862 ic:1049 fr:291
>Apr  7 10:49:20 server1 kernel: 2*4kB 1*8kB 3*16kB 3*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11168kB) Apr  7 10:49:20 server1 kernel: 28*4kB 1*8kB 2*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1528kB) Apr  7 10:49:20 server1 kernel: 37*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1164kB) Apr  7 10:49:20 server1 kernel: Swap cache: add 1777, delete 1292, find 20425/20536, race 0+0 Apr  7 10:49:21 server1 kernel: 218570 pages of slabcache Apr  7 10:49:21 server1 kernel: 196 pages of kernel stacks Apr  7 10:49:21 server1 kernel: 0 lowmem pagetables, 404 highmem pagetables
>Apr  7 10:49:21 server1 kernel: Free swap:       2038872kB
>Apr  7 10:49:21 server1 kernel: 262138 pages of RAM Apr  7 10:49:21 server1 kernel: 32762 pages of HIGHMEM Apr  7 10:49:21 server1 kernel: 5780 reserved pages Apr  7 10:49:22 server1 kernel: 13904 pages shared Apr  7 10:49:22 server1 kernel: 485 pages swap cached Apr  7 10:49:22 server1 kernel: Out of Memory: Killed process 5215 (xfs).
>Apr  7 10:49:22 server1 kernel: Out of Memory: Killed process 5215 (xfs).
>.........
>............
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050418/27468dab/attachment.htm>

From CAugustine at overlandstorage.com  Mon Apr 18 19:56:05 2005
From: CAugustine at overlandstorage.com (CAugustine at overlandstorage.com)
Date: Mon, 18 Apr 2005 12:56:05 -0700
Subject: [Linux-cluster] fence_manual...
Message-ID: <OFA068D17B.E0A95AD9-ON88256FE7.006CD0B9-88256FE7.006D81FB@myoverland.net>

Hi Everyone,

I have a two-node cluster. I have built and installed the cluster sources
from cluster_0406282100 snapshot. I can bring up both nodes successfully,
however, seems like I have a brain split cluster in that each node thinks 
it
is the only node. I ran the cman_tool on each node as follows:

        cman_tool join -c OVLCluster -2 -n nodename

Furthermore, in the log messages I often see messages that require one 
of the nodes to be rebooted. Some times I see the message in both nodes'
/var/log/messages files. In this case, I reboot the node that needs to be
rebooted and run the "fence_ack_maual -s rebooted-nodename" on the 
other system after the reboot. The problem is that then I see the same 
messages
again on the rebooted node's messages file. Seems like the cluster is in
some kind of a loop wanting to reboot that node over and over again after
the reboot. 

Can anyone tell me what is going on? Also, I am running the ccsd, 
cman_tool,
 fence_tool, clvmd, vgchange commands by hand. What version of the 
clustering
software has nice scripts such as "cluster start"?

Thanks,
Caroline

----------------------------------------------------------------------------------------------
See our award-winning line of tape and disk-based
backup & recovery solutions at http://www.overlandstorage.com
----------------------------------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050418/ba9a8680/attachment.htm>

From bmarzins at redhat.com  Mon Apr 18 21:43:39 2005
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Mon, 18 Apr 2005 16:43:39 -0500
Subject: [Linux-cluster] AW: GNBD multipath with devicemapper? -- possible
	solution
In-Reply-To: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de>
References: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de>
Message-ID: <20050418214338.GC8789@phlogiston.msp.redhat.com>

On Fri, Apr 15, 2005 at 04:24:21PM +0200, Hansjoerg.Maurer at dlr.de wrote:
> Hi
> 
> I found a solution for the problem descriped below,
> but I am not sure if it is the right way.
> 
> - importing the two gnbd's (wich point to the same device) from two servers
> -> /dev/gnbd0 and /dev/gnbd1 on the client
> 
> - creating a multipath device with something like this:
> echo "0 167772160  multipath 0 0 1 1 round-robin 0 2 1 251:0 1000 251:1 1000 " | dmsetup create dm0
>  (251:0 ist the major:minor id of /dev/gnbd0)
> 
> - mounting the created device
> eg:
> mount -t gfs /dev/mapper/dm0 /mnt/lvol0
> 
> If I do a write on /mnt/lvol0 the gnbd_server task on both gnbd_servers start (with a noticeable speedup)
> 
> If one gnbd_server fails dm removes that path with the following log
> kernel: device-mapper: dm-multipath: Failing path 251:0.
> 
> I was able to add it again with
> 
> dmsetup  message dm0 0  reinstate_path 251:0
> 
> 
> I was able to deactivate a path manually with
> 
> dmsetup  message dm0 0  fail_path 251:0 
> 
> But I can not unimport the underlying gnbd
> 
> gnbd_import: ERROR cannot disconnect device #1 : Device or resource busy
> 
> 
> Is there a way to remove a gnbd, which is bunndled in a dm-multipath device?
> (might be necessary, if one gnbd server must be rebooted)
> 
> How can I reimport an gnbd on the client in state disconnected?
> (I had to manually start 
> gnbd_recvd -d 0  to do so)
> 
> Is the descriped solution for gnbd multipath the right one?

Um... It's a really ugly one. Unfortunately it's the only one that works, since
multipath-tools do not currently support non-scsi devices.

There are also some bugs in gnbd that make multipathing even more annoying.

But to answer your question, in order to remove gnbd, you must first get it
out of the multipath table, otherwise dm-multipath will still have it open.

To do this, after dmsetup status shows that the path is failed, you run:

# echo "0 167772160  multipath 0 0 1 1 round-robin 0 1 1 251:1 1000 " | dmsetup reload dm0
# dmsetup resume dm0

This removes the gnbd from the path.

However, if you use the gnbd code from the cvs head, it is no longer necessary
to do this to reimport the device.  In the stable branch, gnbd_monitor waits
until all users close the device before setting it to restartable. In the head
code, this happens as soon as the device is successfully fenced. So, if you
loose a gnbd server, reboot it, and reexport the device, gnbd_monitor should
automatically reimport the device, and you can simply run 

# dmsetup  message dm0 0  reinstate_path 251:0

and you should never need to remove the gnbd device with the method I described
above.


-Ben

> Thank you very much
> 
> Greetings from munich
> 
> Hansj?rg
> 
> 
> 
> 
> 
> 
> >Hi
> >
> >I am trying to set up gnbd with multipath.
> >Accoding to the gnbd_usage.txt file, I understand, that this should work with
> >dm-multipath.
> >But unfortunatly only the gfs part of the setup is descriped there.
> >
> >Has anybody experiance with this setup, especially how to set up
> >multipath with multiple /dev/gnbd* and how to setup the multipath.conf file
> >
> >
> >Thank you very much
> >
> >Hansj?rg Maurer
> -- 
> _________________________________________________________________
> 
> Dr.  Hansjoerg Maurer           | LAN- & System-Manager
>                                 |
> Deutsches Zentrum               | DLR Oberpfaffenhofen
>   f. Luft- und Raumfahrt e.V.   |
> Institut f. Robotik             |
> Postfach 1116                   | Muenchner Strasse 20
> 82230 Wessling                  | 82234 Wessling
> Germany                         |
>                                 |
> Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
> Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
> __________________________________________________________________
> 
> 
> There are 10 types of people in this world, 
> those who understand binary and those who don't.
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From bmarzins at redhat.com  Mon Apr 18 21:59:24 2005
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Mon, 18 Apr 2005 16:59:24 -0500
Subject: [Linux-cluster] [PATCH] Fix gnbd-kernel build with 2.6.12rc2
In-Reply-To: <20050414144519.546922A8C@trider-g7.fabbione.net>
References: <20050414144519.546922A8C@trider-g7.fabbione.net>
Message-ID: <20050418215924.GD8789@phlogiston.msp.redhat.com>

On Thu, Apr 14, 2005 at 04:45:19PM +0200, Fabio Massimo Di Nitto wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi everybody,
> 
> the following patch fixes compilation of gnbd.c with 2.6.12rc2.
> 
> The i_sock has been recently removed from the inode structure (change
> happened in the kernel tree the 1st of April) and made part of i_mode.
> 
> Please apply.

Thanks,

I'll get it in shortly.

-Ben
 
> Signed-off-by: Fabio Massimo Di Nitto <fabbione at ubuntu.com>
> 
> Index: gnbd-kernel/src/gnbd.c
> ===================================================================
> RCS file: /cvs/cluster/cluster/gnbd-kernel/src/gnbd.c,v
> retrieving revision 1.7
> diff -u -r1.7 gnbd.c
> - --- gnbd-kernel/src/gnbd.c	7 Apr 2005 16:19:37 -0000	1.7
> +++ gnbd-kernel/src/gnbd.c	14 Apr 2005 14:30:29 -0000
> @@ -735,7 +735,7 @@
>  		if (!file)
>  			return error;
>  		inode = file->f_dentry->d_inode;
> - -		if (!inode->i_sock) {
> +		if (!S_ISSOCK(inode->i_mode)) {
>  			fput(file);
>  			return error;
>  		}
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.5 (GNU/Linux)
> 
> iQIVAwUBQl6BTVA6oBJjVJ+OAQIU7BAAnea52QS9ISXWHXWrrqeEaqFVbm1bSs1A
> +BKMycDiSsDKwttb+/bma2V56gjdqnv7//11wv2IiG5lt1q1HebgVTM+ecPMCRBb
> 6VsJV2NB+HgjRtcNkbAiw7hLVpcG+WFe5VaFVSsG20B5I47n9ahkF0a8umY4zSbd
> O1pCBJA3H4QMiTwlNA8kEj5EBdc3/jB4KCYGwGNhR7m61etZ4JMiEdGlOeQwYMK1
> 4DcXpCgo8aBLACUHGST2e3mnq48ztHHMNI7M0H8BLNrUbhm1EtIEtzyXqJjrS7ku
> TNZKKyfjlioAJk4B718ValMMEifZtlxwjlT3FEYfEd7/MUA2sw6ET4arFbDKcGjU
> Bn5wdFdoVDZpDwhWICfQq2rVleBydNGCyZ4HYMcI3WBi3RKH21zrLnt5YqL9EA/9
> 9TC8PhD24i8+9rp/kmRV3QtWJtooEO2VSfGKJSDXHoeKkt8S2RTByxuBo5UpBMkI
> z/+lB8zlDyF+qvn3TtkaTuJC8fk3clrkQfT+jiI4/7ZztK37NgcCF9Qe1rac3QS4
> VFRTrYJD8hcAOMa40HHCdZTyezetE4N/m6SDOJ+Pps+2KTWYxkJguas0+Aua5yeP
> jyyAV3vmKMmPewbNknw1gHoPTI4pz1QUZ89E3hhnmM1Zoi6y4CMzq1ndv/ZqAROx
> cS4j9lsnd60=
> =+YaG
> -----END PGP SIGNATURE-----
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From teigland at redhat.com  Tue Apr 19 02:35:46 2005
From: teigland at redhat.com (David Teigland)
Date: Tue, 19 Apr 2005 10:35:46 +0800
Subject: [Linux-cluster] fence_manual...
In-Reply-To: <OFA068D17B.E0A95AD9-ON88256FE7.006CD0B9-88256FE7.006D81FB@myoverland.net>
References: <OFA068D17B.E0A95AD9-ON88256FE7.006CD0B9-88256FE7.006D81FB@myoverland.net>
Message-ID: <20050419023546.GA6559@redhat.com>

On Mon, Apr 18, 2005 at 12:56:05PM -0700, CAugustine at overlandstorage.com wrote:

> I have a two-node cluster. I have built and installed the cluster
> sources from cluster_0406282100 snapshot. 

Check out code from the RHEL4 branch of cvs and I think you'll have much
better luck.

-- 
Dave Teigland  <teigland at redhat.com>



From Hansjoerg.Maurer at dlr.de  Tue Apr 19 06:37:54 2005
From: Hansjoerg.Maurer at dlr.de (Hansjoerg Maurer)
Date: Tue, 19 Apr 2005 08:37:54 +0200
Subject: [Linux-cluster] AW: GNBD multipath with devicemapper? -- possible
	solution
In-Reply-To: <20050418214338.GC8789@phlogiston.msp.redhat.com>
References: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de>
	<20050418214338.GC8789@phlogiston.msp.redhat.com>
Message-ID: <4264A742.5020508@dlr.de>

Hi

thank you for your reply,

I have been doing some testing during the weekend and found a better 
solution

In my e-mail from last week I had the following setup

- SAN (sda1+sdb1)
- 2 Nodes directly attached which form a LVM Stripe set aut of sda1 and 
sdb1 and export it (the created lvm)  via gnbd each
- Nodes in the LAN which import the two gnbd's and form a multipath-dm 
target with round-robin policy


It works, but I found a solution wich looks much better.
- SAN (sda1+sdb1)
- 2 Nodes directly attached which  export  sda1+sdb1  via gnbd each 
(sda1 and sdb1 form a striped lvm)
- Nodes in the LAN which gnbd-import  sda1+sdb1 from each node
-> noda_sda1 as gnbd0
-> noda_sdb1 as gnbd1
-> nodb_sda1 as gnbd2
-> nodb_sdb1 as gnbd3
- now I created a failover multipath configuration
echo "0 85385412 multipath 0 0 2 1 round-robin 0 1 1 251:0 1000 
round-robin 0 1 1 251:2 1000" | dmsetup create dma

echo "0 85385412 multipath 0 0 2 1 round-robin 0 1 1 251:3 1000 
round-robin 0 1 1 251:1 1000" | dmsetup create dmb


In this configuration traffic to sda1 goes primaly to noda and traffic 
to sdb1 primaly to nodeb.
I adapt lvm.conf not to include /dev/gnbd in the search for volumgroups, 
instead /dev/mapper/dm (I get rid of the duplicate volumgroup with this 
workaround).
After I start clvmd, I can see the Volume on the client.

With this solution, I have a speedup of about 50% compared to example one
(I think because the stipping is done by the client, whereas in example 
one the client performs round-robin load-balancing
about differnt pathes and the gnbd server stripes on both disks...)

With
dmsetup  message dma 0  disable_group 1
dmsetup  message dmb 0  disable_group 2
dmsetup  message dma 0  enable_group 1
dmsetup  message dmb 0  enable_group 2
I can switch between the two pathes.

It will be a bit of work is to get the startup scripts work correctly, 
because the dmsetup multipath command depends on the major and minor
device ID's of the gnbd-devices of the client, which seem not to bee 
persistent,
Will take some time of scripting, in order to abstract it.... :-)
I will post it, if I have a solution...

The most anoying point is for me at the moment the differnence between 
gnbd read and write performance.
Therefore I am glad, that you as a gnbd-developer answered...
In my tests, gnbd write is about two to three times faster the gnbd reads.
I tried a lot of things (exporting cached, changing readahead with 
blockdev command (on the underlying device), changing TCP-IP buffersizes)
but I had nor improvement.

In the upper example, I get a write speed of about 85MB/s over gnbd and 
a read speed of about 26 MB/s .
(the underlying device's sda and sdb manages about 50MB/s (read and write).
Therefore read speed is very good....

First I thougt, it might be related to the strange dm-setup I was 
running, and therefore I
tried it with gnbd-exporting and importing just a single block device 
(without lvm and dm)
but the problem remains...

Do I have misconfiguerd something completly (I am using GBEth bonding 
devices) or can you or anybody else confirm the
behavior of much better write than read performance?
I was testing with RHEL4 2.6.9-6.38.EL

Thank you for your help and your great work...

Greetings from a rainy morning  in Munich


Hansj?rg
 












Benjamin Marzinski wrote:

>On Fri, Apr 15, 2005 at 04:24:21PM +0200, Hansjoerg.Maurer at dlr.de wrote:
>  
>
>>Hi
>>
>>I found a solution for the problem descriped below,
>>but I am not sure if it is the right way.
>>
>>- importing the two gnbd's (wich point to the same device) from two servers
>>-> /dev/gnbd0 and /dev/gnbd1 on the client
>>
>>- creating a multipath device with something like this:
>>echo "0 167772160  multipath 0 0 1 1 round-robin 0 2 1 251:0 1000 251:1 1000 " | dmsetup create dm0
>> (251:0 ist the major:minor id of /dev/gnbd0)
>>
>>- mounting the created device
>>eg:
>>mount -t gfs /dev/mapper/dm0 /mnt/lvol0
>>
>>If I do a write on /mnt/lvol0 the gnbd_server task on both gnbd_servers start (with a noticeable speedup)
>>
>>If one gnbd_server fails dm removes that path with the following log
>>kernel: device-mapper: dm-multipath: Failing path 251:0.
>>
>>I was able to add it again with
>>
>>dmsetup  message dm0 0  reinstate_path 251:0
>>
>>
>>I was able to deactivate a path manually with
>>
>>dmsetup  message dm0 0  fail_path 251:0 
>>
>>But I can not unimport the underlying gnbd
>>
>>gnbd_import: ERROR cannot disconnect device #1 : Device or resource busy
>>
>>
>>Is there a way to remove a gnbd, which is bunndled in a dm-multipath device?
>>(might be necessary, if one gnbd server must be rebooted)
>>
>>How can I reimport an gnbd on the client in state disconnected?
>>(I had to manually start 
>>gnbd_recvd -d 0  to do so)
>>
>>Is the descriped solution for gnbd multipath the right one?
>>    
>>
>
>Um... It's a really ugly one. Unfortunately it's the only one that works, since
>multipath-tools do not currently support non-scsi devices.
>
>There are also some bugs in gnbd that make multipathing even more annoying.
>
>But to answer your question, in order to remove gnbd, you must first get it
>out of the multipath table, otherwise dm-multipath will still have it open.
>
>To do this, after dmsetup status shows that the path is failed, you run:
>
># echo "0 167772160  multipath 0 0 1 1 round-robin 0 1 1 251:1 1000 " | dmsetup reload dm0
># dmsetup resume dm0
>
>This removes the gnbd from the path.
>
>However, if you use the gnbd code from the cvs head, it is no longer necessary
>to do this to reimport the device.  In the stable branch, gnbd_monitor waits
>until all users close the device before setting it to restartable. In the head
>code, this happens as soon as the device is successfully fenced. So, if you
>loose a gnbd server, reboot it, and reexport the device, gnbd_monitor should
>automatically reimport the device, and you can simply run 
>
># dmsetup  message dm0 0  reinstate_path 251:0
>
>and you should never need to remove the gnbd device with the method I described
>above.
>
>
>-Ben
>
>  
>
>>Thank you very much
>>
>>Greetings from munich
>>
>>Hansj?rg
>>
>>
>>
>>
>>
>>
>>    
>>
>>>Hi
>>>
>>>I am trying to set up gnbd with multipath.
>>>Accoding to the gnbd_usage.txt file, I understand, that this should work with
>>>dm-multipath.
>>>But unfortunatly only the gfs part of the setup is descriped there.
>>>
>>>Has anybody experiance with this setup, especially how to set up
>>>multipath with multiple /dev/gnbd* and how to setup the multipath.conf file
>>>
>>>
>>>Thank you very much
>>>
>>>Hansj?rg Maurer
>>>      
>>>
>>-- 
>>_________________________________________________________________
>>
>>Dr.  Hansjoerg Maurer           | LAN- & System-Manager
>>                                |
>>Deutsches Zentrum               | DLR Oberpfaffenhofen
>>  f. Luft- und Raumfahrt e.V.   |
>>Institut f. Robotik             |
>>Postfach 1116                   | Muenchner Strasse 20
>>82230 Wessling                  | 82234 Wessling
>>Germany                         |
>>                                |
>>Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
>>Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
>>__________________________________________________________________
>>
>>
>>There are 10 types of people in this world, 
>>those who understand binary and those who don't.
>>
>>
>>
>>
>>--
>>Linux-cluster mailing list
>>Linux-cluster at redhat.com
>>http://www.redhat.com/mailman/listinfo/linux-cluster
>>    
>>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>  
>


-- 
_________________________________________________________________

Dr.  Hansjoerg Maurer           | LAN- & System-Manager
                                |
Deutsches Zentrum               | DLR Oberpfaffenhofen
  f. Luft- und Raumfahrt e.V.   |
Institut f. Robotik             |
Postfach 1116                   | Muenchner Strasse 20
82230 Wessling                  | 82234 Wessling
Germany                         |
                                |
Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
__________________________________________________________________


There are 10 types of people in this world, 
those who understand binary and those who don't.



From fabbione at fabbione.net  Tue Apr 19 07:37:18 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Tue, 19 Apr 2005 09:37:18 +0200 (CEST)
Subject: [Linux-cluster] [PATCH] (cosmetic) do not configure cmirror
Message-ID: <20050419073718.6CB662D55@trider-g7.fabbione.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hi everybody,
   just a minor change to configure. cmirror has been commented
out from all the targets in the toplevel Makefile,
but it is still configured.

Skip cmirror configuration, since it is not built anymore.

Patch against CVS HEAD 2005-04-19 07:13 UTC.

Signed-off-by: Fabio M. Di Nitto <fabbione at ubuntu.com>

Thanks
Fabio

- --- configure	17 Nov 2004 04:29:09 -0000	1.4
+++ configure	19 Apr 2005 07:29:42 -0000
@@ -45,5 +45,5 @@
 echo "configure rgmanager"
 (cd rgmanager; ./configure $@)
 
- -echo "configure cmirror"
- -(cd cmirror; ./configure $@)
+#echo "configure cmirror"
+#(cd cmirror; ./configure $@)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iQIVAwUBQmS0s1A6oBJjVJ+OAQLAQxAAuYRnxOrDGcQQphhNbrcmu123bnodeGhI
9dKLaq0rgXjIQ2klhkKd491/waLQbQmoyVHJFYJeCgGQ9HeiVQB11giSZ6eDISWb
+AtmBarkYZCIq8DqyHJ/eNSISVr7H6YF91PGEZNgV5oJ8uFmzNz8fhJ/fRyGbJrX
BXDqF6+HpwGr0vVlThV8dEXFsStT8Nh9HPYD4IyBsOrwJ0Tl2H9a5GY4EjKyGqTJ
WGIIQz2wjizAKy5J8S2uKIrXiDpHN0MprLUa7lqWswIo22/OE03tnF1VqC8y8/4T
3F+IE66/YHBJ+m4G5qWc3qCZGyGJnWKtH24dFENg/TxNrqjB2o0Srbi9tOCy/FYb
dEfd3eAVdG8Jpyg02ayRi3aaHQW2/7JO6ELEAVKxUapxNUnfq7c4JoTxIza1Q/gp
SDMUf93EBWe123/xJHPBMOzVDPu/dQF1GP5P8FbOR/xfS1jk1YvM1/cmyubzLvyd
t1XPQtjSAM+eqxkO+rnjs6vngi0RlezuW08ET3WNWX5JMZgzjxwyRGZ/Q28gK+7a
98cOCGwxYkOjtZtQJeyhS4GNrCpHOeT0ok4KVbcY8w2DUPwz+m7+1U23n/IYZNSC
7SDbfIdbnDoxzn05gvKcx45c4V/yOdHa5wY+EIg+mHAVoF6AunFH8iss3u8N/JrE
K7mIbYdvYCk=
=NMFk
-----END PGP SIGNATURE-----



From fabbione at fabbione.net  Tue Apr 19 07:41:01 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Tue, 19 Apr 2005 09:41:01 +0200 (CEST)
Subject: [Linux-cluster] [PATCH] (cosmetic) propagate distclean to gfs_fsck
Message-ID: <20050419074101.0BE3E2D55@trider-g7.fabbione.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi again,
   just another cosmetic change.
gfs needs to propagate distclean at least to gfs_fsck otherwise
there is a bunch of clutter left around :)

Patch against CVS HEAD 2005-04-19 07:13 UTC.

Signed-off-by: Fabio M. Di Nitto <fabbione at ubuntu.com>

Thanks
Fabio

- --- gfs/Makefile	31 Mar 2005 05:15:50 -0000	1.4
+++ gfs/Makefile	19 Apr 2005 07:37:44 -0000
@@ -43,6 +43,7 @@
 	cd gfs_tool && ${MAKE} clean
 
 distclean: clean
+	cd gfs_fsck && ${MAKE} distclean
 	rm -f make/defines.mk
 
 install:
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iQIVAwUBQmS2C1A6oBJjVJ+OAQIeXQ//ZyQOoSV2MN4H/RpgphppqfTKW/nNAFht
0/54Pi/E8JdAppUrxA/sWg93cyr1Qjzsv5eAYtNT/3tuwMUSm1aoJt3tkiCCTP69
yWLf/cu8/FIQKSHHZ6vhGzuTmGb3qYpCZxdLndsctuSZehR64UGQQYYjCc2pwgYV
5c+w07N8Zrl8MTUCYbcTQLDC5IaIwHaPHJDPIClLRte2u4u2rxo0ij1UtQsdgqrG
6VakdEdWMjCdNgzcZpcsmoAmICygQ4sZZkdVYdjKI+8gGts20OS+n4aRMJZ56k2C
wqI1gRT0ujK7+mtAt866zcd1lsE0V/rQP8CrXSEl2y0sqfY8RLCSdLpEBMji3f+5
wHmeJxB7rW/NM936BtFBpd15ymhwdnfDIVJCVl/r29GPM390FwGzLL0r05p4wsIo
yCUGojKPjq7KJy7Aq1z+CzIxHu2OFwJgI0Qceay4XuvgcF3WYgwKaS8sQ/e7AonJ
6UbtMqlo9JdHHyiMlIqdKnfI+arvU3ftrF59AybG1T+J3/jh2Dlh8bJ8xzXeQAEO
O1eRgbRHoaoWw57NNfpFBA0N1bsjBU60nG3Cj8yhDxYxlPCWTw0NXT+9gZdX5H/3
eiTfrFs9+rX96CSNdzG1cIagPY8FeJqNxvL+Pf7W2LshH9A7iKnnzjwbTRDEzA8A
jIFgml+0eNY=
=Niv8
-----END PGP SIGNATURE-----



From birger at birger.sh  Tue Apr 19 11:32:41 2005
From: birger at birger.sh (birger)
Date: Tue, 19 Apr 2005 13:32:41 +0200
Subject: [Linux-cluster] clusterfs.sh: misleading description of parameter
	'options'
Message-ID: <4264EC59.8040009@birger.sh>


The description of the perameter 'options' in clusterfs.sh talks about 
doing file system check... It seems to be useable for setting any kind 
of mount option, so the description is misleading. Same for fs.sh. 
netfs.sh is ok.

-- 
birger



From birger at birger.sh  Tue Apr 19 13:08:18 2005
From: birger at birger.sh (birger)
Date: Tue, 19 Apr 2005 15:08:18 +0200
Subject: [Linux-cluster] How to set up NFS HA service
Message-ID: <426502C2.7030803@birger.sh>

Debugging a cluster setup with this software could have been easier 
given better error messages from the components, but I'm getting there...

I thought I'd just mount my gfs file systems outside the resource 
manager's control to have them present all the time and just use the 
resource manager to move over the IP address and do the NFS magic. That 
seems impossible, as I couldn't get any exports to happen when I defined 
them in cluster.conf without a surrounding <fs>. I could define the 
exports in /etc/exports, but then I would have to synch files. So in the 
end I put all my gfs file systems into cluster.conf.

It almost works. I get mounts, and they get exported. But I have some 
error messages in the log file and the exports take a loooong time. Only 
2 of the 3 exports defined seem to show up.

I'm also a bit puzzled about why the file systems don't get unmounted 
when I disable all services.

As for file locking:
I copied /etc/init.d/nfslock to /etc/init.d/nfslock-svc and made some 
changes.
First, I added a little code to enable nfslock to read a variable 
STATD_STATEDIR for the -p option from the config file in /etc/sysconfig. 
I think this should get propagated back to upcoming fedora releases if 
someone who knows how would bother to do it... I then changed 
nfslock-svc to read a different config file (/etc/sysconfig/nfs-svc) and 
to do 'service nfslock stop' at the top of the start section and 
'service nfslock start' at the bottom of the stop section.
This enables me to have statd running as e.g. 'server1' on the cluster 
node until it takes over the nfs service. At takeover, statd gets 
restarted with statedir on a cluster file system (so it can take over 
lock info belonging to the service) and with the name of the NFS service 
IP address. Does this sound reasonable? I know I'll loose any locks the 
cluster node may have had (as NFS client) when it takes over the nfs 
service, but I cannot see any reason why the cluster node should have 
nfs locks (or nfs mounts for that matter) except when doing admin work. 
I think I could fix it by copying /var/lib/nfs/statd/sm* into the 
clustered file system right after the 'service nfslock stop' I put in.

I have appended part of my messages file and my cluster.conf file. Any 
help with my NFS export issues will be appreciated.

-- 
birger

-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster.conf
Type: text/xml
Size: 2950 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050419/bd2d3168/attachment.xml>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: messages
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050419/bd2d3168/attachment.ksh>

From srua at plus.net  Tue Apr 19 13:30:03 2005
From: srua at plus.net (Sergio Rua)
Date: Tue, 19 Apr 2005 14:30:03 +0100
Subject: [Linux-cluster] getting started
Message-ID: <426507DB.2080702@plus.net>

Hi,

I'm getting started with GFS but I cannot find documentation to get a
cluster running. Is there anything I could read? Thanks.

-- 
Sergio Rua 



From Birger.Wathne at ift.uib.no  Mon Apr 18 07:19:29 2005
From: Birger.Wathne at ift.uib.no (Birger Wathne)
Date: Mon, 18 Apr 2005 09:19:29 +0200
Subject: [Linux-cluster] Configuration of a 2 node HA cluster with gfs
In-Reply-To: <1113577167.20618.155.camel@ayanami.boston.redhat.com>
References: <425E2180.6060609@birger.sh>
	<1113577167.20618.155.camel@ayanami.boston.redhat.com>
Message-ID: <42635F81.60801@uib.no>

Lon Hohberger wrote:
> On Thu, 2005-04-14 at 09:53 +0200, birger wrote:
.
.
.
> Fencing is required in order for CMAN to operate in any useful capacity
> in 2-node mode.
I currently use manual fencing, as the other node in the cluster doesn't 
exist yet... :-)
> 
> Anyway, to make this short: You probably want fencing for your solution.
.
.
.
> With GFS, the file locking should just kind of "work", but the client
> would be required to fail over.  I don't think the Linux NFS client can
> do this, but I believe the Solaris one can... (correct me if I'm wrong
> here).
When I worked with sun clients they could select between alternative servers 
at mount, but not fail over from one server to another if the server became 
unavailable.

> With a pure NFS failover solution (ex: on ext3, w/o replicated cluster
> locks), there needs to be some changes to nfsd, lockd, and rpc.statd in
> order to make lock failover work seamlessly.
I once did this on a sun system by stopping statd and merging the contents 
of /etc/sm* from the failing node to the takeover node and then restarting. 
This seemed to have statd/lockd rechecking the locks with the clients. I 
hoped something similar could be done on linux.

> You can use rgmanager to do the IP and Samba failover.  Take a look at
> "rgmanager/src/daemons/tests/*.conf".  I don't know how well Samba
> failover has been tested.
This was a big help! The only documentation I found when searching for 
rgmanager on the net used <resourcegroup> instead of <service>. No wonder I 
couldn't get my services up!

I now have my NFS service starting with this block in cluster.conf:
<rm>
   <failoverdomains>
     <failoverdomain name="nfsdomain" ordered="0" restricted="1">
       <failoverdomainnode name="server1" priority="1"/>
       <failoverdomainnode name="server2" priority="2"/>
     </failoverdomain>
     <failoverdomain name="smbdomain" ordered="0" restricted="1">
       <failoverdomainnode name="server1" priority="2"/>
       <failoverdomainnode name="server2" priority="1"/>
     </failoverdomain>
   </failoverdomains>

   <resources>
   </resources>

   <service name="nfssvc" domain="nfsdomain">
     <ip address="X.X.X.X" monitor_link="yes"/>
     <script name="NFS script" file="/etc/init.d/nfs"/>
     <nfsexport name="NFS exports" mountpoint="/service/gfs001">
        <nfsclient name="nis-hosts" target="@nis-hosts" options="rw"/>
     </nfsexport>
   </service>

   <service name="smbsvc" domain="smbdomain">
   </service>
</rm>

It starts nfssvc, but smbsvc fails. No worry, since it is useless at the 
moment. server1 is the only existing server in the cluster.

The big surprise for me was that ifconfig and exportfs don't show the IP 
address and exports set up by the cluster, but at least the ip certainly 
works. My problem now is that I get permission denied when mounting on the 
clients, and the logfile on the server says the clients are unknown. Seems 
like it isn't resolving them, as they are listed only with ip address in the 
log. Or could it be that I cannot use <nfsexport> without a <fs>?

How would I normally go about having a gfs file system mounted at boot. 
Create a service bound to server1 that mounts it? Or can it be put in 
/etc/fstab? smbsvc is supposed to operate on the same file system, so I want 
the file system to always be there independent of the nfssvc and smbsvc.

Thanks for all help so far. I'm getting close... :-)

-- 
birger



From lhh at redhat.com  Tue Apr 19 15:58:34 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 19 Apr 2005 11:58:34 -0400
Subject: [Linux-cluster] How to set up NFS HA service
In-Reply-To: <426502C2.7030803@birger.sh>
References: <426502C2.7030803@birger.sh>
Message-ID: <1113926314.20618.258.camel@ayanami.boston.redhat.com>

On Tue, 2005-04-19 at 15:08 +0200, birger wrote:

> I thought I'd just mount my gfs file systems outside the resource 
> manager's control to have them present all the time and just use the 
> resource manager to move over the IP address and do the NFS magic. That 
> seems impossible, as I couldn't get any exports to happen when I defined 
> them in cluster.conf without a surrounding <fs>.

Known bug/feature:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=151669

You can change this behavior if you wanted to by adding <child type=...>
to service.sh's "special" element in the XML meta-data.

> It almost works. I get mounts, and they get exported. But I have some 
> error messages in the log file and the exports take a loooong time. Only 
> 2 of the 3 exports defined seem to show up.
> 
> I'm also a bit puzzled about why the file systems don't get unmounted 
> when I disable all services.

They're GFS.  Add force_unmount="1" to the <fs> elements if you want
them to be umounted.  GFS is nice because you *don't* have to umount
it.  

FYI, NFS services on traditional file systems don't cleanly stop right
now due to an EBUSY during umount from the kernel.  Someone's looking in
to it on the NFS side (apparently, not all the refs are getting cleared
if a node has an NFS mount ref and we unexport the FS, or something).


> As for file locking:
> I copied /etc/init.d/nfslock to /etc/init.d/nfslock-svc and made some 
> changes.
> First, I added a little code to enable nfslock to read a variable 
> STATD_STATEDIR for the -p option from the config file in /etc/sysconfig. 
> I think this should get propagated back to upcoming fedora releases if 
> someone who knows how would bother to do it... I then changed 
> nfslock-svc to read a different config file (/etc/sysconfig/nfs-svc) and 
> to do 'service nfslock stop' at the top of the start section and 
> 'service nfslock start' at the bottom of the stop section.
> This enables me to have statd running as e.g. 'server1' on the cluster 
> node until it takes over the nfs service. At takeover, statd gets 
> restarted with statedir on a cluster file system (so it can take over 
> lock info belonging to the service) and with the name of the NFS service 
> IP address. Does this sound reasonable? 

Sort of.  There's a lot needed to make NFS locks fail over properly, and
some of it has to be done in the kernel.  My memories of working on this
had the following things (this might be incomplete):

rpc.statd:

- Make rpc.statd monitor who had locks (SM_MONITOR requests from lockd)
based on the inbound IP address on which the lock request was received.
This isn't easy, because the only client of rpc.statd is lockd on the
local node.  Also, you'll need to have this list either replicated or on
a shared partition which can be moved about with the rest of the
service.

- Make rpc.statd be able to to notify clients who took locks based on a
specific IP address (I think my patch or one just like it eventually
made it into nfs-utils, so this might be done).

lockd:

- Add a method to tell lockd to set a grace period for lock recovery.
This should only be done for a specific IP or device only, not on
everything -- otherwise, every time an NFS service was started/failed-
over, every single NFS client who wanted a lock would have to wait for
the grace period - which sucks.

- Provide a way for lockd to tell rpc.statd what IP address the lock
request came in on.


> Apr 19 14:42:58 server1 clurgmgrd[7498]: <notice> Service nfssvc started
> Apr 19 14:43:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts-ro" returned 1 (generic error)
> Apr 19 14:43:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts" returned 1 (generic error)
> Apr 19 14:44:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts-ro" returned 1 (generic error)
> Apr 19 14:44:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts" returned 1 (generic error)

Hmm, that's odd, it could be a bug in the status phase which is related
to NIS exports.  Does this only happen after a failover, or does it
happen all the time?

-- Lon



From bmarzins at redhat.com  Tue Apr 19 16:19:31 2005
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Tue, 19 Apr 2005 11:19:31 -0500
Subject: [Linux-cluster] AW: GNBD multipath with devicemapper? -- possible
	solution
In-Reply-To: <4264A742.5020508@dlr.de>
References: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de>
	<20050418214338.GC8789@phlogiston.msp.redhat.com>
	<4264A742.5020508@dlr.de>
Message-ID: <20050419161931.GE8789@phlogiston.msp.redhat.com>

On Tue, Apr 19, 2005 at 08:37:54AM +0200, Hansjoerg Maurer wrote:
> 
> The most anoying point is for me at the moment the differnence between 
> gnbd read and write performance.
> Therefore I am glad, that you as a gnbd-developer answered...
> In my tests, gnbd write is about two to three times faster the gnbd reads.
> I tried a lot of things (exporting cached, changing readahead with 
> blockdev command (on the underlying device), changing TCP-IP buffersizes)
> but I had nor improvement.
> 
> In the upper example, I get a write speed of about 85MB/s over gnbd and 
> a read speed of about 26 MB/s .
> (the underlying device's sda and sdb manages about 50MB/s (read and write).
> Therefore read speed is very good....

Um. On my machines I get 41.6707 MB/sec for reads and 41.6805 MB/sec for
writes using lmdd, so I don't see your difference.  If I just write to the
gnbd device, and don't sync the device after the writes I get a write speed of
54.3400 MB/sec.  This only takes into account how fast gnbd can write out all
the requests, not how long it takes for them to actually get written to disk
on the server. I would be suspicious of a gnbd write speed that is higher than
the served device's speed.

-Ben 

> First I thougt, it might be related to the strange dm-setup I was 
> running, and therefore I
> tried it with gnbd-exporting and importing just a single block device 
> (without lvm and dm)
> but the problem remains...
> 
> Do I have misconfiguerd something completly (I am using GBEth bonding 
> devices) or can you or anybody else confirm the
> behavior of much better write than read performance?
> I was testing with RHEL4 2.6.9-6.38.EL
> 
> Thank you for your help and your great work...
> 
> Greetings from a rainy morning  in Munich
> 
> 
> Hansj?rg
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Benjamin Marzinski wrote:
> 
> >On Fri, Apr 15, 2005 at 04:24:21PM +0200, Hansjoerg.Maurer at dlr.de wrote:
> > 
> >
> >>Hi
> >>
> >>I found a solution for the problem descriped below,
> >>but I am not sure if it is the right way.
> >>
> >>- importing the two gnbd's (wich point to the same device) from two 
> >>servers
> >>-> /dev/gnbd0 and /dev/gnbd1 on the client
> >>
> >>- creating a multipath device with something like this:
> >>echo "0 167772160  multipath 0 0 1 1 round-robin 0 2 1 251:0 1000 251:1 
> >>1000 " | dmsetup create dm0
> >>(251:0 ist the major:minor id of /dev/gnbd0)
> >>
> >>- mounting the created device
> >>eg:
> >>mount -t gfs /dev/mapper/dm0 /mnt/lvol0
> >>
> >>If I do a write on /mnt/lvol0 the gnbd_server task on both gnbd_servers 
> >>start (with a noticeable speedup)
> >>
> >>If one gnbd_server fails dm removes that path with the following log
> >>kernel: device-mapper: dm-multipath: Failing path 251:0.
> >>
> >>I was able to add it again with
> >>
> >>dmsetup  message dm0 0  reinstate_path 251:0
> >>
> >>
> >>I was able to deactivate a path manually with
> >>
> >>dmsetup  message dm0 0  fail_path 251:0 
> >>
> >>But I can not unimport the underlying gnbd
> >>
> >>gnbd_import: ERROR cannot disconnect device #1 : Device or resource busy
> >>
> >>
> >>Is there a way to remove a gnbd, which is bunndled in a dm-multipath 
> >>device?
> >>(might be necessary, if one gnbd server must be rebooted)
> >>
> >>How can I reimport an gnbd on the client in state disconnected?
> >>(I had to manually start 
> >>gnbd_recvd -d 0  to do so)
> >>
> >>Is the descriped solution for gnbd multipath the right one?
> >>   
> >>
> >
> >Um... It's a really ugly one. Unfortunately it's the only one that works, 
> >since
> >multipath-tools do not currently support non-scsi devices.
> >
> >There are also some bugs in gnbd that make multipathing even more annoying.
> >
> >But to answer your question, in order to remove gnbd, you must first get it
> >out of the multipath table, otherwise dm-multipath will still have it open.
> >
> >To do this, after dmsetup status shows that the path is failed, you run:
> >
> ># echo "0 167772160  multipath 0 0 1 1 round-robin 0 1 1 251:1 1000 " | 
> >dmsetup reload dm0
> ># dmsetup resume dm0
> >
> >This removes the gnbd from the path.
> >
> >However, if you use the gnbd code from the cvs head, it is no longer 
> >necessary
> >to do this to reimport the device.  In the stable branch, gnbd_monitor 
> >waits
> >until all users close the device before setting it to restartable. In the 
> >head
> >code, this happens as soon as the device is successfully fenced. So, if you
> >loose a gnbd server, reboot it, and reexport the device, gnbd_monitor 
> >should
> >automatically reimport the device, and you can simply run 
> >
> ># dmsetup  message dm0 0  reinstate_path 251:0
> >
> >and you should never need to remove the gnbd device with the method I 
> >described
> >above.
> >
> >
> >-Ben
> >
> > 
> >
> >>Thank you very much
> >>
> >>Greetings from munich
> >>
> >>Hansj?rg
> >>
> >>
> >>
> >>
> >>
> >>
> >>   
> >>
> >>>Hi
> >>>
> >>>I am trying to set up gnbd with multipath.
> >>>Accoding to the gnbd_usage.txt file, I understand, that this should work 
> >>>with
> >>>dm-multipath.
> >>>But unfortunatly only the gfs part of the setup is descriped there.
> >>>
> >>>Has anybody experiance with this setup, especially how to set up
> >>>multipath with multiple /dev/gnbd* and how to setup the multipath.conf 
> >>>file
> >>>
> >>>
> >>>Thank you very much
> >>>
> >>>Hansj?rg Maurer
> >>>     
> >>>
> >>-- 
> >>_________________________________________________________________
> >>
> >>Dr.  Hansjoerg Maurer           | LAN- & System-Manager
> >>                               |
> >>Deutsches Zentrum               | DLR Oberpfaffenhofen
> >> f. Luft- und Raumfahrt e.V.   |
> >>Institut f. Robotik             |
> >>Postfach 1116                   | Muenchner Strasse 20
> >>82230 Wessling                  | 82234 Wessling
> >>Germany                         |
> >>                               |
> >>Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
> >>Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
> >>__________________________________________________________________
> >>
> >>
> >>There are 10 types of people in this world, 
> >>those who understand binary and those who don't.
> >>
> >>
> >>
> >>
> >>--
> >>Linux-cluster mailing list
> >>Linux-cluster at redhat.com
> >>http://www.redhat.com/mailman/listinfo/linux-cluster
> >>   
> >>
> >
> >--
> >Linux-cluster mailing list
> >Linux-cluster at redhat.com
> >http://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
> > 
> >
> 
> 
> -- 
> _________________________________________________________________
> 
> Dr.  Hansjoerg Maurer           | LAN- & System-Manager
>                                |
> Deutsches Zentrum               | DLR Oberpfaffenhofen
>  f. Luft- und Raumfahrt e.V.   |
> Institut f. Robotik             |
> Postfach 1116                   | Muenchner Strasse 20
> 82230 Wessling                  | 82234 Wessling
> Germany                         |
>                                |
> Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
> Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
> __________________________________________________________________
> 
> 
> There are 10 types of people in this world, 
> those who understand binary and those who don't.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From lhh at redhat.com  Tue Apr 19 16:44:52 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 19 Apr 2005 12:44:52 -0400
Subject: [Linux-cluster] clusterfs.sh: misleading description of
	parameter 'options'
In-Reply-To: <4264EC59.8040009@birger.sh>
References: <4264EC59.8040009@birger.sh>
Message-ID: <1113929093.20618.260.camel@ayanami.boston.redhat.com>

On Tue, 2005-04-19 at 13:32 +0200, birger wrote:
> The description of the perameter 'options' in clusterfs.sh talks about 
> doing file system check... It seems to be useable for setting any kind 
> of mount option, so the description is misleading. Same for fs.sh. 
> netfs.sh is ok.

Whoops!  I'll fix that...

-- Lon



From birger at birger.sh  Tue Apr 19 18:47:49 2005
From: birger at birger.sh (birger)
Date: Tue, 19 Apr 2005 20:47:49 +0200
Subject: [Linux-cluster] How to set up NFS HA service
In-Reply-To: <1113926314.20618.258.camel@ayanami.boston.redhat.com>
References: <426502C2.7030803@birger.sh>
	<1113926314.20618.258.camel@ayanami.boston.redhat.com>
Message-ID: <42655255.3000106@birger.sh>

I think my first attempt to answer ended up in the bit bucket because of a 
wlan problem while I saved it to the drafts folder. Sigh...

Lon Hohberger wrote:

> On Tue, 2005-04-19 at 15:08 +0200, birger wrote:
> 
> Known bug/feature:
> 
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=151669
> 
> You can change this behavior if you wanted to by adding <child type=...>
> to service.sh's "special" element in the XML meta-data.

I thought about trying just that, but believed it couldn't be that simple... :-D


>>I'm also a bit puzzled about why the file systems don't get unmounted 
>>when I disable all services.
> 
> 
> They're GFS.  Add force_unmount="1" to the <fs> elements if you want
> them to be umounted.  GFS is nice because you *don't* have to umount
> it.

That was exactly why I wanted to mount the gfs file systems outside the 
service. I am very happy with this unexpected behaviour. I want the file 
systems to be there. :-)

I was afraid they didn't unmount because of some problem.

> FYI, NFS services on traditional file systems don't cleanly stop right
> now due to an EBUSY during umount from the kernel.  Someone's looking in
> to it on the NFS side (apparently, not all the refs are getting cleared
> if a node has an NFS mount ref and we unexport the FS, or something).

I saw a very similar problem some years ago on Solaris with Veritas 
FirstWatch. fuser and lofs came up empty, but still the file system was busy 
when I tried to umount. I found a workaround... Restarting statd and lockd 
and then umount. Seems like they had their paws in the file system somehow.
Since FirstWatch was mostly a bunch of sh scripts it was easy to modify the 
nfs umount code to do this.

Regarding lockd, I think my solution is valid given the 2 restraints:
- The cluster nodes should not be NFS clients (and thanks to GFS I don't 
need that)
- There should only be one NFS service running on any cluster node. And I 
only have one NFS service.

When I set the name for statd to the name of the service IP address and 
relocate the status dir to a cluster disk, a takeover should behave just 
like a server reboot, shouldn't it?

  >>Apr 19 14:42:58 server1 clurgmgrd[7498]: <notice> Service nfssvc started
>>Apr 19 14:43:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts-ro" returned 1 (generic error)
>>Apr 19 14:43:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts" returned 1 (generic error)
>>Apr 19 14:44:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts-ro" returned 1 (generic error)
>>Apr 19 14:44:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts" returned 1 (generic error)
> 
> 
> Hmm, that's odd, it could be a bug in the status phase which is related
> to NIS exports.  Does this only happen after a failover, or does it
> happen all the time?

My cluster only has one node (even if I have defined 2 nodes). I have to get 
the first node production ready and migrate everything over first. Then make 
the old file server a second cluster node.

I'll have a look around and see if I can find a solution.

-- 
birger



From alex at DSRLab.com  Tue Apr 19 18:24:07 2005
From: alex at DSRLab.com (Alex Vrenios)
Date: Tue, 19 Apr 2005 11:24:07 -0700
Subject: [Linux-cluster] getting started
In-Reply-To: <426507DB.2080702@plus.net>
Message-ID: <200504191924.j3JJOFco023434@mx1.redhat.com>

> Hi,
> I'm getting started with GFS but I cannot find documentation 
> to get a cluster running. Is there anything I could read? Thanks.
> --
> Sergio Rua 

Have a look at
http://www.redhat.com/docs/manuals/enterprise/RHEL-3-Manual/cluster-suite/
particularly Appendix B and C.




From lhh at redhat.com  Tue Apr 19 19:37:58 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 19 Apr 2005 15:37:58 -0400
Subject: [Linux-cluster] How to set up NFS HA service
In-Reply-To: <42655255.3000106@birger.sh>
References: <426502C2.7030803@birger.sh>
	<1113926314.20618.258.camel@ayanami.boston.redhat.com>
	<42655255.3000106@birger.sh>
Message-ID: <1113939478.20618.269.camel@ayanami.boston.redhat.com>

On Tue, 2005-04-19 at 20:47 +0200, birger wrote:

> Regarding lockd, I think my solution is valid given the 2 restraints:
> - The cluster nodes should not be NFS clients (and thanks to GFS I don't 
> need that)
> - There should only be one NFS service running on any cluster node. And I 
> only have one NFS service.

Ah, ok, this might work then.  I've never tested anything quite like it.
One thing to note is that when you take an NFS lock on GFS file systems,
the lock will exist on the other cluster node too.  This is because NFS
can export the same GFS filesystem on multiple nodes.  I'm not sure what
would happen during the lock-reclaim grace period if you tried to
relocate a service while a client still had locks (since the locks would
exist on both nodes...).

Ken, any idea on this?

> When I set the name for statd to the name of the service IP address and 
> relocate the status dir to a cluster disk, a takeover should behave just 
> like a server reboot, shouldn't it?

In principle, that's all a failover/relocation should ever look like to
clients.

> My cluster only has one node (even if I have defined 2 nodes). I have to get 
> the first node production ready and migrate everything over first. Then make 
> the old file server a second cluster node.
> 
> I'll have a look around and see if I can find a solution.

Ok, it could be a bug in the status code when looking for NIS exports.
I'll try to take a look at it this week.

-- Lon



From lhh at redhat.com  Tue Apr 19 21:19:09 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Tue, 19 Apr 2005 17:19:09 -0400
Subject: [Linux-cluster] Configuration of a 2 node HA cluster with gfs
In-Reply-To: <42635F81.60801@uib.no>
References: <425E2180.6060609@birger.sh>
	<1113577167.20618.155.camel@ayanami.boston.redhat.com>
	<42635F81.60801@uib.no>
Message-ID: <1113945549.20618.275.camel@ayanami.boston.redhat.com>

Sorry, missed this one...

On Mon, 2005-04-18 at 09:19 +0200, Birger Wathne wrote:

> The big surprise for me was that ifconfig and exportfs don't show the IP 
> address and exports set up by the cluster, but at least the ip certainly 
> works.

/sbin/ip addr list <-- should show the IPs

exportfs should show the exports.  If it's not, there may be something
broken.  I'll have a look.


>  My problem now is that I get permission denied when mounting on the 
> clients, and the logfile on the server says the clients are unknown. Seems 
> like it isn't resolving them, as they are listed only with ip address in the 
> log. Or could it be that I cannot use <nfsexport> without a <fs>?

<nfsexport> just sets up a path for <nfsclient> to inherit. ;)

The thought here is that clients can be reused.  Exports don't get set
up unless there are clients.

> How would I normally go about having a gfs file system mounted at boot. 
> Create a service bound to server1 that mounts it? Or can it be put in 
> /etc/fstab? smbsvc is supposed to operate on the same file system, so I want 
> the file system to always be there independent of the nfssvc and smbsvc.

Use fstab.  Putting GFS file systems in rgmanager's config section just
causes it to make sure it's mounted.

-- Lon



From birger at uib.no  Tue Apr 19 22:19:53 2005
From: birger at uib.no (Birger Wathne)
Date: Wed, 20 Apr 2005 00:19:53 +0200
Subject: [Linux-cluster] Configuration of a 2 node HA cluster with gfs
In-Reply-To: <1113945549.20618.275.camel@ayanami.boston.redhat.com>
References: <425E2180.6060609@birger.sh>	<1113577167.20618.155.camel@ayanami.boston.redhat.com>	<42635F81.60801@uib.no>
	<1113945549.20618.275.camel@ayanami.boston.redhat.com>
Message-ID: <42658409.3000005@uib.no>

Lon Hohberger wrote:

> /sbin/ip addr list <-- should show the IPs

Yes, I found that one by reading ip.sh

>exportfs should show the exports.  If it's not, there may be something
>broken.  I'll have a look.
>  
>
The only thing broken was that I could not have exports without wrapping 
them in <fs> (or <clusterfs>)

> Use fstab. Putting GFS file systems in rgmanager's config section just
>
>causes it to make sure it's mounted.
>  
>
Since I couldn't export them without also mounting them (or at least 
check that they are mounted) in the service I currently don't have them 
in fstab.

-- 
birger



From Hansjoerg.Maurer at dlr.de  Wed Apr 20 05:38:19 2005
From: Hansjoerg.Maurer at dlr.de (=?ISO-8859-15?Q?Hansj=F6rg_Maurer?=)
Date: Wed, 20 Apr 2005 07:38:19 +0200
Subject: [Linux-cluster] AW: GNBD multipath with devicemapper? -- possible
	solution
In-Reply-To: <20050419161931.GE8789@phlogiston.msp.redhat.com>
References: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de>	<20050418214338.GC8789@phlogiston.msp.redhat.com>	<4264A742.5020508@dlr.de>
	<20050419161931.GE8789@phlogiston.msp.redhat.com>
Message-ID: <4265EACB.5060305@dlr.de>

Hi,

thank you for yor reply.

Benjamin Marzinski wrote:

>On Tue, Apr 19, 2005 at 08:37:54AM +0200, Hansjoerg Maurer wrote:
>  
>
>>The most anoying point is for me at the moment the differnence between 
>>gnbd read and write performance.
>>Therefore I am glad, that you as a gnbd-developer answered...
>>In my tests, gnbd write is about two to three times faster the gnbd reads.
>>I tried a lot of things (exporting cached, changing readahead with 
>>blockdev command (on the underlying device), changing TCP-IP buffersizes)
>>but I had nor improvement.
>>
>>In the upper example, I get a write speed of about 85MB/s over gnbd and 
>>a read speed of about 26 MB/s .
>>(the underlying device's sda and sdb manages about 50MB/s (read and write).
>>Therefore read speed is very good....
>>    
>>
>
>Um. On my machines I get 41.6707 MB/sec for reads and 41.6805 MB/sec for
>writes using lmdd, so I don't see your difference.  If I just write to the
>gnbd device, and don't sync the device after the writes I get a write speed of
>54.3400 MB/sec.  This only takes into account how fast gnbd can write out all
>the requests, not how long it takes for them to actually get written to disk
>on the server. I would be suspicious of a gnbd write speed that is higher than
>the served device's speed.
>
>-Ben 
>  
>
the device speed of one disk is about 50 MB/s, but the two gnbd's (zwo 
seperate disks) are bundeled to one striped lvm.
Therefore the speed is possible.
I have seen the difference in read and write speed on two installations.
Kernel 2.6.9-6.38.EL and gfs from RHEL4 cvs , tested with cfq and 
anticipatory elevator.
What kernel are you using?

Thank you very much

Hansj?rg

>  
>
>  
>


-- 
_________________________________________________________________

Dr.  Hansjoerg Maurer           | LAN- & System-Manager
                                |
Deutsches Zentrum               | DLR Oberpfaffenhofen
  f. Luft- und Raumfahrt e.V.   |
Institut f. Robotik             |
Postfach 1116                   | Muenchner Strasse 20
82230 Wessling                  | 82234 Wessling
Germany                         |
                                |
Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
__________________________________________________________________


There are 10 types of people in this world, 
those who understand binary and those who don't.



From srua at plus.net  Wed Apr 20 08:40:23 2005
From: srua at plus.net (Sergio Rua)
Date: Wed, 20 Apr 2005 09:40:23 +0100
Subject: [Linux-cluster] getting started
In-Reply-To: <200504191924.j3JJOFco023434@mx1.redhat.com>
References: <200504191924.j3JJOFco023434@mx1.redhat.com>
Message-ID: <42661577.6070106@plus.net>

Alex,

>>I'm getting started with GFS but I cannot find documentation 
>>    
>>
>Have a look at
>http://www.redhat.com/docs/manuals/enterprise/RHEL-3-Manual/cluster-suite/
>particularly Appendix B and C.
>  
>
Great, exactly what I was looking for. Thanks a lot.

-- 
Sergio Rua 



From phillips at redhat.com  Wed Apr 20 09:06:38 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Wed, 20 Apr 2005 09:06:38 +0000
Subject: [Linux-cluster] Configuration of a 2 node HA cluster with gfs
In-Reply-To: <425E2180.6060609@birger.sh>
References: <425E2180.6060609@birger.sh>
Message-ID: <200504200906.38427.phillips@redhat.com>

On Thursday 14 April 2005 07:53, birger wrote:
> I have a lot of experience with Veritas FirstWatch and some with
> SunCluster, so I am not new to HA services. Now I have this server that
> I have to get up and running as quickly as possible... And I find little
> documentation about how to get this software up and running from a fresh
> install.
>
> I have one old file server with external scsi disks, and one new server
> with a SCSI-attached Nexsan ATAboy RAID array.
>
> I want to set up the new file server as half of a 2-node cluster and get
> it into production. Then move over data (and disks) from the old server
> until I can reinstall that one as the second cluster node.

You need a cluster mirror.  See cluster/cmirror on sourceware.org.  
Alternatively, you could run an order 0 ddraid array, which is equivalent 
to a cluster mirror, but caters to the non-shared disk case as well.   

Actually, your case amounts to a shared disk since you will not export any 
storage from the smaller node, as I understand it.  However, DDRaid will 
work fine, and should you decide to export storage from the second node as 
well, it will just work, giving you an easy upgrade path.

That is the good news.  Now I must temper it with: ... !DDRaid is still 
pre-alpha! ... Don't expect it to make any sense for a production machine 
for at least the next few months.  See cluster/ddraid for more information.

(I will demonstrate ddraid running on five or ten nodes of a 12 node 
distributed data cluster tomorrow afternoon here in Canberra, so it is not 
_that_ far away.)

Regards,

Daniel



From jbrassow at redhat.com  Wed Apr 20 01:38:56 2005
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Tue, 19 Apr 2005 20:38:56 -0500
Subject: [Linux-cluster] getting started
In-Reply-To: <426507DB.2080702@plus.net>
References: <426507DB.2080702@plus.net>
Message-ID: <a8ea399cf9e4b148e85d5736539856b1@redhat.com>

sources.redhat.com/cluster  Try looking under documentation.

Kind Regards,
  brassow

On Apr 19, 2005, at 8:30 AM, Sergio Rua wrote:

> Hi,
>
> I'm getting started with GFS but I cannot find documentation to get a
> cluster running. Is there anything I could read? Thanks.
>
> -- 
> Sergio Rua
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster
>



From pbruna at linuxcenterla.com  Wed Apr 20 15:25:59 2005
From: pbruna at linuxcenterla.com (Patricio Bruna V.)
Date: Wed, 20 Apr 2005 11:25:59 -0400
Subject: [Linux-cluster] GFS 6
Message-ID: <200504201125.59565.pbruna@linuxcenterla.com>

any option to use gulm with 2 nodes in reduntant way?
or use gdlm with GFS 6?
-- 
Patricio Bruna
pbruna at linuxcenterla.com
Red Hat Certified Engineer
Jefe Soporte y Operaciones LinuxCenter S.A.
Canada 239, 5to piso, Providencia, Chile
http://www.linuxcenterla.com +56-2-2745000
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050420/d8a29ccc/attachment.sig>

From bmarzins at redhat.com  Wed Apr 20 17:07:36 2005
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Wed, 20 Apr 2005 12:07:36 -0500
Subject: [Linux-cluster] AW: GNBD multipath with devicemapper? -- possible
	solution
In-Reply-To: <4265EACB.5060305@dlr.de>
References: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de>
	<20050418214338.GC8789@phlogiston.msp.redhat.com>
	<4264A742.5020508@dlr.de>
	<20050419161931.GE8789@phlogiston.msp.redhat.com>
	<4265EACB.5060305@dlr.de>
Message-ID: <20050420170736.GH8789@phlogiston.msp.redhat.com>

On Wed, Apr 20, 2005 at 07:38:19AM +0200, Hansj?rg Maurer wrote:
> Hi,
> 
> thank you for yor reply.
> 
> Benjamin Marzinski wrote:
> 
> >On Tue, Apr 19, 2005 at 08:37:54AM +0200, Hansjoerg Maurer wrote:
> > 
> >
> >>The most anoying point is for me at the moment the differnence between 
> >>gnbd read and write performance.
> >>Therefore I am glad, that you as a gnbd-developer answered...
> >>In my tests, gnbd write is about two to three times faster the gnbd reads.
> >>I tried a lot of things (exporting cached, changing readahead with 
> >>blockdev command (on the underlying device), changing TCP-IP buffersizes)
> >>but I had nor improvement.
> >>
> >>In the upper example, I get a write speed of about 85MB/s over gnbd and 
> >>a read speed of about 26 MB/s .
> >>(the underlying device's sda and sdb manages about 50MB/s (read and 
> >>write).
> >>Therefore read speed is very good....
> >>   
> >>
> >
> >Um. On my machines I get 41.6707 MB/sec for reads and 41.6805 MB/sec for
> >writes using lmdd, so I don't see your difference.  If I just write to the
> >gnbd device, and don't sync the device after the writes I get a write 
> >speed of
> >54.3400 MB/sec.  This only takes into account how fast gnbd can write out 
> >all
> >the requests, not how long it takes for them to actually get written to 
> >disk
> >on the server. I would be suspicious of a gnbd write speed that is higher 
> >than
> >the served device's speed.
> >
> >-Ben 
> > 
> >
> the device speed of one disk is about 50 MB/s, but the two gnbd's (zwo 
> seperate disks) are bundeled to one striped lvm.
> Therefore the speed is possible.
> I have seen the difference in read and write speed on two installations.
> Kernel 2.6.9-6.38.EL and gfs from RHEL4 cvs , tested with cfq and 
> anticipatory elevator.
> What kernel are you using?

I got those numbers on a vanilla 2.6.11 kernel.
Running on a 2.6.9-6.24.EL kernel, I get
45.6608 MB/sec writes 
35.8578 MB/sec reads

Not as pronouced as your numbers, but a noticeable speed difference.

-Ben
> Thank you very much
> 
> Hansj?rg
> 
> > 
> >
> > 
> >
> 
> 
> -- 
> _________________________________________________________________
> 
> Dr.  Hansjoerg Maurer           | LAN- & System-Manager
>                                |
> Deutsches Zentrum               | DLR Oberpfaffenhofen
>  f. Luft- und Raumfahrt e.V.   |
> Institut f. Robotik             |
> Postfach 1116                   | Muenchner Strasse 20
> 82230 Wessling                  | 82234 Wessling
> Germany                         |
>                                |
> Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
> Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
> __________________________________________________________________
> 
> 
> There are 10 types of people in this world, 
> those who understand binary and those who don't.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From pcaulfie at redhat.com  Wed Apr 20 18:02:49 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Wed, 20 Apr 2005 19:02:49 +0100
Subject: [Linux-cluster] [PATCH] Fix usage of sk_alloc in cman-kernel
	(2.6.12rc2)
In-Reply-To: <4263692F.5010906@fabbione.net>
References: <20050416083344.1BCE02B9F@trider-g7.fabbione.net>	<20050418075623.GB6015@tykepenguin.com>
	<4263692F.5010906@fabbione.net>
Message-ID: <42669949.3000105@redhat.com>

I've checked in the cman & dlm fixes to the FC4 branch (mainly because
they are needed for the latest Fedora kernels!).

cvs HEAD will be having some major upheaval soon as I move cman to
userspace, so there's little point in doing it there too...

Patrick



From fabbione at fabbione.net  Wed Apr 20 18:09:57 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Wed, 20 Apr 2005 20:09:57 +0200
Subject: [Linux-cluster] [PATCH] Fix usage of sk_alloc in cman-kernel
	(2.6.12rc2)
In-Reply-To: <42669949.3000105@redhat.com>
References: <20050416083344.1BCE02B9F@trider-g7.fabbione.net>	<20050418075623.GB6015@tykepenguin.com>	<4263692F.5010906@fabbione.net>
	<42669949.3000105@redhat.com>
Message-ID: <42669AF5.1090700@fabbione.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Patrick Caulfield wrote:
> I've checked in the cman & dlm fixes to the FC4 branch (mainly because
> they are needed for the latest Fedora kernels!).
> 
> cvs HEAD will be having some major upheaval soon as I move cman to
> userspace, so there's little point in doing it there too...
> 
> Patrick

It is of course more than fine by me :)

Thanks for the feedback
Fabio
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCZprzhCzbekR3nhgRAgMKAJ0dpSTRbyDktIG5E9BSKlq9f3hoFQCfWAtg
ohEEjAZDRjv360WQ2Y5qgBg=
=wu7G
-----END PGP SIGNATURE-----



From pbruna at linuxcenterla.com  Wed Apr 20 18:39:16 2005
From: pbruna at linuxcenterla.com (Patricio Bruna V.)
Date: Wed, 20 Apr 2005 14:39:16 -0400
Subject: [Linux-cluster] GFS 7?
Message-ID: <200504201439.17171.pbruna@linuxcenterla.com>

when will all this new stuff be available for RH4, or will be oficcialy 
supported by RedHat. any dates?
-- 
Patricio Bruna
pbruna at linuxcenterla.com
Red Hat Certified Engineer
Jefe Soporte y Operaciones LinuxCenter S.A.
Canada 239, 5to piso, Providencia, Chile
http://www.linuxcenterla.com +56-2-2745000
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050420/f9693ce2/attachment.sig>

From Hansjoerg.Maurer at dlr.de  Thu Apr 21 05:58:49 2005
From: Hansjoerg.Maurer at dlr.de (=?ISO-8859-15?Q?Hansj=F6rg_Maurer?=)
Date: Thu, 21 Apr 2005 07:58:49 +0200
Subject: [Linux-cluster] AW: GNBD multipath with devicemapper? -- possible
	solution
In-Reply-To: <20050420170736.GH8789@phlogiston.msp.redhat.com>
References: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de>	<20050418214338.GC8789@phlogiston.msp.redhat.com>	<4264A742.5020508@dlr.de>	<20050419161931.GE8789@phlogiston.msp.redhat.com>	<4265EACB.5060305@dlr.de>
	<20050420170736.GH8789@phlogiston.msp.redhat.com>
Message-ID: <42674119.4090605@dlr.de>

Hi

thank you very much,
Do you have any idea what causes the speed difference between vanilla 
and RHEL4 kernel?
Should I officially file a bug in bugzilla, or do you take care of the 
problem?

Greetings


Hansj?rg





>I got those numbers on a vanilla 2.6.11 kernel.
>Running on a 2.6.9-6.24.EL kernel, I get
>45.6608 MB/sec writes 
>35.8578 MB/sec reads
>
>Not as pronouced as your numbers, but a noticeable speed difference.
>
>-Ben
>  
>
>>Thank you very much
>>
>>Hansj?rg
>>
>>    
>>
>>>
>>>
>>>      
>>>
>>-- 
>>_________________________________________________________________
>>
>>Dr.  Hansjoerg Maurer           | LAN- & System-Manager
>>                               |
>>Deutsches Zentrum               | DLR Oberpfaffenhofen
>> f. Luft- und Raumfahrt e.V.   |
>>Institut f. Robotik             |
>>Postfach 1116                   | Muenchner Strasse 20
>>82230 Wessling                  | 82234 Wessling
>>Germany                         |
>>                               |
>>Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
>>Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
>>__________________________________________________________________
>>
>>
>>There are 10 types of people in this world, 
>>those who understand binary and those who don't.
>>
>>--
>>Linux-cluster mailing list
>>Linux-cluster at redhat.com
>>http://www.redhat.com/mailman/listinfo/linux-cluster
>>    
>>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
>
>  
>

-- 
_________________________________________________________________

Dr.  Hansjoerg Maurer           | LAN- & System-Manager
                                |
Deutsches Zentrum               | DLR Oberpfaffenhofen
  f. Luft- und Raumfahrt e.V.   |
Institut f. Robotik             |
Postfach 1116                   | Muenchner Strasse 20
82230 Wessling                  | 82234 Wessling
Germany                         |
                                |
Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
__________________________________________________________________


There are 10 types of people in this world, 
those who understand binary and those who don't.



From fabbione at fabbione.net  Thu Apr 21 06:48:50 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Thu, 21 Apr 2005 08:48:50 +0200
Subject: [Linux-cluster] Problems after upgrade :(
In-Reply-To: <20050418144812.GH6015@tykepenguin.com>
References: <20050418131110.780E43B21F0@poczta.interia.pl>
	<20050418144812.GH6015@tykepenguin.com>
Message-ID: <42674CD2.9050609@fabbione.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Patrick Caulfield wrote:
> On Mon, Apr 18, 2005 at 03:11:10PM +0200, ptr at poczta.fm wrote:
> 
>>    Hello.
>>
>>I installed the newest CVS version from the scratch
>>(because of noticed problems with old libs and binaries
>>remaining in system directories even after "make install" 
>>with certain components. 
>>Anyways, now I'm getting those entries as below after node startup
>>and even during normal work:
> 
> 
> Head of CVS is not a good thing to use. Checkout the RHEL4 branch instead.
> 
> 

I think we will need to investigate this problem sooner or later. I can reproduce it
here too and it might as well affect Fedora 4 kernels.

Fabio

- --
Self-Service law:
The last available dish of the food you have decided to eat, will be
inevitably taken from the person in front of you.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCZ0zQhCzbekR3nhgRArTyAJ9FTDFN78NHQqh+NZbgDywFmCnc9ACfVnri
6KySDi77iX/OkXGKH3uxCjE=
=FHxy
-----END PGP SIGNATURE-----



From pcaulfie at redhat.com  Thu Apr 21 09:22:34 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 21 Apr 2005 10:22:34 +0100
Subject: [Linux-cluster] cman goes to userland
Message-ID: <426770DA.9010007@redhat.com>

This morning I check in new userland cman code onto the head of CVS.

cman_tool & libcman on head now refer to this cman rather than the
kernel one.

If you want to continue using the kernel-based cman then you should
checkout the RHEL4 or FC4 branches of CVS.

Patrick



From fabbione at fabbione.net  Thu Apr 21 09:23:40 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Thu, 21 Apr 2005 11:23:40 +0200 (CEST)
Subject: [Linux-cluster] [PATCH] libgulm not hadled correctly?
Message-ID: <20050421092340.CD9B8305B@trider-g7.fabbione.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi everybody,

while running an automated test to verify library integrity and consistency,
I noticed that libgulm.a was not an ar archive as it is supposed to be and
that libgulm.so.$(RELEASE_MAJOR) amd libgulm.so where not symlinks, but
plain copy of the library.

I am not sure if this was just an overlook or if it was intended to be that way.

The following patch does what I think is the right thing.

Please review before applying!

Thanks
Fabio M. Di Nitto

- --- gulm/Makefile.orig	2005-04-21 11:09:08.000000000 +0200
+++ gulm/Makefile	2005-04-21 11:21:30.507016896 +0200
@@ -80,16 +80,15 @@
 #######################################
 # build rules
 
- -all: src/lock_gulmd src/gulm_tool lib/libgulm.so lib/libgulm.so.$(RELEASE_MAJOR)
+all: src/lock_gulmd src/gulm_tool lib/libgulm.so lib/libgulm.so.$(RELEASE_MAJOR) lib/libgulm.a
 
 src/lock_gulmd: $(gulmd_src:.c=.o) 
 	${CC} ${CFLAGS} ${LDFLAGS} $^ ${LDLIBS} -lccs -o $@
 
 lib/libgulm.a: $(lib_src:.c=.o)
- -	${LD} ${LDFLAGS} -r $^ -o $@ --retain-symbols-file=lib/exported_symbols.sym
- -	#xdr_* functions still `visable'. for relocation i think.
+	${AR} cr $@ $^
 
- -lib/libgulm.so.$(RELEASE_MAJOR).$(RELEASE_MINOR): lib/libgulm.a
+lib/libgulm.so.$(RELEASE_MAJOR).$(RELEASE_MINOR): $(lib_src:.c=.o)
 	${LD} -shared -soname libgulm.so.$(RELEASE_MAJOR) -o $@ $^ -lc
 
 lib/libgulm.so.$(RELEASE_MAJOR): lib/libgulm.so.$(RELEASE_MAJOR).$(RELEASE_MINOR)
@@ -121,8 +120,8 @@
 	install -d ${libdir}
 	install -m644 lib/libgulm.a ${libdir}
 	install -m644 lib/libgulm.so.$(RELEASE_MAJOR).$(RELEASE_MINOR) ${libdir}
- -	install -m644 lib/libgulm.so.$(RELEASE_MAJOR) ${libdir}
- -	install -m644 lib/libgulm.so ${libdir}
+	cd $(libdir); ln -snf libgulm.so.$(RELEASE_MAJOR).$(RELEASE_MINOR) libgulm.so.$(RELEASE_MAJOR); cd -
+	cd $(libdir); ln -snf libgulm.so.$(RELEASE_MAJOR).$(RELEASE_MINOR) libgulm.so; cd -
 	install -d ${incdir}
 	install -m644 lib/libgulm.h ${incdir}
 	cd init.d && ${MAKE} install
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iQIVAwUBQmdw7FA6oBJjVJ+OAQLKIBAAiVkNZjBP7s8Oree6AtCrbiM6WbZPs7iH
ABXCZjLcjnIZUz2sJPCuOcQnULsUm6g3505gC/fLxf/0H9WbZwLjNtnYXKwdxgzW
oVvUwqw0qJNR1+fStO7Z/QLY8YeGII8qYtAap6RE/TpAwjh0LtSwYrXapMw8UWlS
lT0GdtTZWAAab4ri3pMmW81hd+SZ/1aYeD38XcF3UesHL4PkClCJtPXjFGin5Phg
WUG93W0gFVVAZwIykd+iixO7kByvUz8P+1TKUOqF/II0jyWhd2nSidkel0YUmSMQ
JhmMTMJcBewPtcUOG0pbKC6aoD7bV+8FAXxYOwSGZ0jNl0O1H6djbX3tveTC0C0n
aQJqIlP4G41MPeky3cKlayRJqKPunNI5ZAWun4YqMb1jp958gQRVigH0SEz47xkT
M1l78qH07/fH8FmqvMKrUuf9kjRhaDjJKdlfphL4vyQWJlFC5U6m3XjamgS75qCP
HTMbgQi+jEEL/bVaTeX+VI0Jp79OCSTkHMZiie6p2zI6lb6jKBJv8IicjVaUQmgY
X0vR2veYiNC9Sxd/ppO6xewZpW3szQsOYKxJxzHXhnoKKklFC8Tz8QL446Je+iih
L+72Oxq3QUBkEip4oIVuahgUmFOVmQ4AFrCpqpkQ4JMSX7AqnM7OVDHKAK9Gy92V
Fpe1FFaJ4tU=
=gUYv
-----END PGP SIGNATURE-----



From mtilstra at redhat.com  Thu Apr 21 13:52:08 2005
From: mtilstra at redhat.com (Michael Conrad Tadpol Tilstra)
Date: Thu, 21 Apr 2005 08:52:08 -0500
Subject: [Linux-cluster] [PATCH] libgulm not hadled correctly?
In-Reply-To: <20050421092340.CD9B8305B@trider-g7.fabbione.net>
References: <20050421092340.CD9B8305B@trider-g7.fabbione.net>
Message-ID: <20050421135208.GA13363@redhat.com>

On Thu, Apr 21, 2005 at 11:23:40AM +0200, Fabio Massimo Di Nitto wrote:
> while running an automated test to verify library integrity and consistency,
> I noticed that libgulm.a was not an ar archive as it is supposed to be and
> that libgulm.so.$(RELEASE_MAJOR) amd libgulm.so where not symlinks, but
> plain copy of the library.
> 
> I am not sure if this was just an overlook or if it was intended to be
> that way.
> 
> The following patch does what I think is the right thing.
> 
> Please review before applying!

Um, wild guess, but this is on CVS HEAD?
assuming so, since that's the only branch where I forgot to fix the
symlinks to the .so
(Now that there are four branches, it would be nice if you could say
which branch your patch is on.)

I'll get that checked in.

As for using ld instead of ar to build the .a, well I copied that from
somewhere... i thought from the magma libs, but I see that they are using
ar, so I duno... I'll switch that too then.

Thanks for the patch.

-- 
Michael Conrad Tadpol Tilstra
Oh yeah, anybody got an ethernet card to fit my night table?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050421/4cd9b8ba/attachment.sig>

From fabbione at fabbione.net  Thu Apr 21 14:23:35 2005
From: fabbione at fabbione.net (Fabio Massimo Di Nitto)
Date: Thu, 21 Apr 2005 16:23:35 +0200
Subject: [Linux-cluster] [PATCH] libgulm not hadled correctly?
In-Reply-To: <20050421135208.GA13363@redhat.com>
References: <20050421092340.CD9B8305B@trider-g7.fabbione.net>
	<20050421135208.GA13363@redhat.com>
Message-ID: <4267B767.9090005@fabbione.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Conrad Tadpol Tilstra wrote:
> On Thu, Apr 21, 2005 at 11:23:40AM +0200, Fabio Massimo Di Nitto wrote:
> 
>>while running an automated test to verify library integrity and consistency,
>>I noticed that libgulm.a was not an ar archive as it is supposed to be and
>>that libgulm.so.$(RELEASE_MAJOR) amd libgulm.so where not symlinks, but
>>plain copy of the library.
>>
>>I am not sure if this was just an overlook or if it was intended to be
>>that way.
>>
>>The following patch does what I think is the right thing.
>>
>>Please review before applying!
> 
> 
> Um, wild guess, but this is on CVS HEAD?
> assuming so, since that's the only branch where I forgot to fix the
> symlinks to the .so
> (Now that there are four branches, it would be nice if you could say
> which branch your patch is on.)

Yes sorry.. I forgot about that, but it was against CVS head.

I will take more care next time

Fabio
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCZ7dmhCzbekR3nhgRApP9AJ4/nZrHa0+KJS/iQylN0dUmRzLMDQCePSMj
CsdI5XqkyBdPHCUzPDh7RjU=
=knYZ
-----END PGP SIGNATURE-----



From lhh at redhat.com  Thu Apr 21 17:35:59 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Thu, 21 Apr 2005 13:35:59 -0400
Subject: [Linux-cluster] cman goes to userland
In-Reply-To: <426770DA.9010007@redhat.com>
References: <426770DA.9010007@redhat.com>
Message-ID: <1114104959.20618.325.camel@ayanami.boston.redhat.com>

On Thu, 2005-04-21 at 10:22 +0100, Patrick Caulfield wrote:
> This morning I check in new userland cman code onto the head of CVS.
> 
> cman_tool & libcman on head now refer to this cman rather than the
> kernel one.
> 
> If you want to continue using the kernel-based cman then you should
> checkout the RHEL4 or FC4 branches of CVS.

Hi Patrick,

We'll need to add a magma plugin for this (or replace the existing one
in HEAD) in order to make things like CCS, gnbd, rgmanager, et. al.
start working again.

Has the DLM interface changed as well?

-- Lon



From pcaulfie at redhat.com  Thu Apr 21 18:25:51 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Thu, 21 Apr 2005 19:25:51 +0100
Subject: [Linux-cluster] cman goes to userland
In-Reply-To: <1114104959.20618.325.camel@ayanami.boston.redhat.com>
References: <426770DA.9010007@redhat.com>
	<1114104959.20618.325.camel@ayanami.boston.redhat.com>
Message-ID: <4267F02F.3060307@redhat.com>

Lon Hohberger wrote:
> 
> We'll need to add a magma plugin for this (or replace the existing one
> in HEAD) in order to make things like CCS, gnbd, rgmanager, et. al.
> start working again.

It should be pretty easy. the "new" cman is really just a port of the
old one.


> Has the DLM interface changed as well?
> 
No. not at all.

All my existing DLM test programs run quite happily (without a
recompile) on either.

-- 

patrick



From CAugustine at overlandstorage.com  Thu Apr 21 18:59:15 2005
From: CAugustine at overlandstorage.com (CAugustine at overlandstorage.com)
Date: Thu, 21 Apr 2005 11:59:15 -0700
Subject: [Linux-cluster] What versions of the linux, cluster,
 dm and lvm2 tend to work well togther???
Message-ID: <OF4B54D2CE.F68F2644-ON88256FEA.00666B17-88256FEA.00684CC3@myoverland.net>

Hi Everyone,

I am running with the 2.6.10-1.770_FC3CRAW.root linux kernel,
which I built it to support raw devices.

I downloaded the CVS sources for RHEL4 branch of the cluster 
and the LVM2, and the device-mapper sources are from the HEAD 
of the CVS tree. I built and installed them on both of my linux systems. 

After rebooting:

The ccsd command is executed successfully on both nodes.
The "cman_tool join" command executed on both nodes eventually
shows the msg:          Cluster is quorate.  Allowing connections.

However, looking at the /proc/cluster/nodes files on both systems,
I see the following:

        Node  Votes Exp Sts  Name
        1    1    1   M   ovn01

        Node  Votes Exp Sts  Name
        2    1    1   M   ovn02

Why on ovn02 node, Node=2 and on ovn01 Node=1?
Has each node created a cluster on its own? Some sort of brain split???

Then, I ran "fence_tool join" successfully on both systems. 
However, the "clvmd" cmd gave me the following messages on 
both systems:
 
         Apr 20 10:06:09 localhost clvmd: Unable to create lockspace for 
                                             CLVM: Inappropriate ioctl for 
device
         Apr 20 10:06:10 localhost clvmd: Cannot login in to CCSD server

With the above kernel, what versions of the cluster, dm and LVM2 
are more likely to work well? Can anyone point me to ok versions of
linux, dm, cluster and LVM2 rpms/srpms that work well together???

Any comments from you are greatly appreciated....

Thanks,
Caroline



----------------------------------------------------------------------------------------------
See our award-winning line of tape and disk-based
backup & recovery solutions at http://www.overlandstorage.com
----------------------------------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050421/abaf4413/attachment.htm>

From CAugustine at overlandstorage.com  Thu Apr 21 20:29:40 2005
From: CAugustine at overlandstorage.com (CAugustine at overlandstorage.com)
Date: Thu, 21 Apr 2005 13:29:40 -0700
Subject: [Linux-cluster] Re: Linux-cluster Digest, Vol 12, Issue 17
Message-ID: <OF8F5E8746.6735CD27-ON88256FEA.006FB632-88256FEA.0070945C@myoverland.net>

Hi Everyone,

I have compiled and installed the CVS RHEL4 branch of cluster 
and the CVS HEAD sources for dm and lvm2 with my FC3 2.6.9-1.667
linux kernel. All seems to be working ok (i.e. ccsd; cman_tool join; 
fence_tool join)
until I run "clvmd" which errors out with the following messages:

        clvmd could not connect to cluster manager
        Consult syslog for more information

Looking at /var/log/messages I see the messages:

        Apr 20 11:42:45 ovn02 clvmd: Cannot login in to CCSD server
        Apr 20 11:48:50 ovn02 clvmd: Unable to create lockspace for CLVM: 
Inappropriate ioctl for device

Running clvmd with strace (i.e. strace -o xx -ff clvmd -d) I see the 
following calls:

        open("/dev/misc/dlm_control, O_RDWR) return 5
        ioctl(5, 0x400044401, 0x8084b84) return -1 with errno=ENOTTY

I cannot figure out what the above ioctl command is and why it is 
rejected. 
Can anyone shed some light on this?

Thanks,
Caroline
 

----------------------------------------------------------------------------------------------
See our award-winning line of tape and disk-based
backup & recovery solutions at http://www.overlandstorage.com
----------------------------------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050421/4a86d0a5/attachment.htm>

From pcaulfie at redhat.com  Fri Apr 22 07:31:46 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 22 Apr 2005 08:31:46 +0100
Subject: [Linux-cluster] What versions of the linux, cluster, dm and lvm2
	tend to work well togther???
In-Reply-To: <OF4B54D2CE.F68F2644-ON88256FEA.00666B17-88256FEA.00684CC3@myoverland.net>
References: <OF4B54D2CE.F68F2644-ON88256FEA.00666B17-88256FEA.00684CC3@myoverland.net>
Message-ID: <4268A862.3010005@redhat.com>

CAugustine at overlandstorage.com wrote:
> 
> Hi Everyone,
> 
> I am running with the 2.6.10-1.770_FC3CRAW.root linux kernel,
> which I built it to support raw devices.
> 
> I downloaded the CVS sources for RHEL4 branch of the cluster
> and the LVM2, and the device-mapper sources are from the HEAD
> of the CVS tree. I built and installed them on both of my linux systems.
> 
> After rebooting:
> 
> The ccsd command is executed successfully on both nodes.
> The "cman_tool join" command executed on both nodes eventually
> shows the msg:                 Cluster is quorate.  Allowing connections.
> 
> However, looking at the /proc/cluster/nodes files on both systems,
> I see the following:
> 
>         Node  Votes Exp Sts  Name
>            1    1    1   M   ovn01
> 
>         Node  Votes Exp Sts  Name
>            2    1    1   M   ovn02
> 
> Why on ovn02 node, Node=2 and on ovn01 Node=1?
> Has each node created a cluster on its own? Some sort of brain split???

Yep that's a split brain. If expected_votes was set (correctly) to 2
then you wouldn't have quorum. /never/ set expected_votes to 1 on a real
system. "cman_tool join" will set it to the total number of nodes in
ccsd so check you have the same custer.conf file on both nodes.

I worry that your syslog still names the host "localhost", that might
easily be a network misconfiguration which would cause nodes not to be
able to see each other.

> Then, I ran "fence_tool join" successfully on both systems.
> However, the "clvmd" cmd gave me the following messages on
> both systems:
> 
>         Apr 20 10:06:09 localhost clvmd: Unable to create lockspace for
>                                              CLVM: Inappropriate ioctl
> for device

Are you sure this is up-to-date ? creating a lockspace hasn't been an
ioctl operation for months now.
-- 

patrick



From lhh at redhat.com  Fri Apr 22 13:48:17 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 22 Apr 2005 09:48:17 -0400
Subject: [Linux-cluster] cman goes to userland
In-Reply-To: <4267F02F.3060307@redhat.com>
References: <426770DA.9010007@redhat.com>
	<1114104959.20618.325.camel@ayanami.boston.redhat.com>
	<4267F02F.3060307@redhat.com>
Message-ID: <1114177697.20618.434.camel@ayanami.boston.redhat.com>

On Thu, 2005-04-21 at 19:25 +0100, Patrick Caulfield wrote:

> It should be pretty easy. the "new" cman is really just a port of the
> old one.

... except the old plugin predates libcman (by a long time), and uses
the ioctl() methods of talking to the kernel ;)

I don't think it'll be much work.  Just making sure I post to the
mailing list so I remember to do it.  I'm more likely to remember when
there are 100 people reminding me, you know? :)

-- Lon



From pcaulfie at redhat.com  Fri Apr 22 13:51:41 2005
From: pcaulfie at redhat.com (Patrick Caulfield)
Date: Fri, 22 Apr 2005 14:51:41 +0100
Subject: [Linux-cluster] cman goes to userland
In-Reply-To: <1114177697.20618.434.camel@ayanami.boston.redhat.com>
References: <426770DA.9010007@redhat.com>	<1114104959.20618.325.camel@ayanami.boston.redhat.com>	<4267F02F.3060307@redhat.com>
	<1114177697.20618.434.camel@ayanami.boston.redhat.com>
Message-ID: <4269016D.50002@redhat.com>

Lon Hohberger wrote:
> On Thu, 2005-04-21 at 19:25 +0100, Patrick Caulfield wrote:
> 
> 
>>It should be pretty easy. the "new" cman is really just a port of the
>>old one.
> 
> 
> ... except the old plugin predates libcman (by a long time), and uses
> the ioctl() methods of talking to the kernel ;)
>

Yes, but even the old libcman was just a wrapper round the ioctls, so
the concepts are the same.
-- 

patrick



From bmarzins at redhat.com  Fri Apr 22 15:36:46 2005
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Fri, 22 Apr 2005 10:36:46 -0500
Subject: [Linux-cluster] AW: GNBD multipath with devicemapper? -- possible
	solution
In-Reply-To: <42674119.4090605@dlr.de>
References: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de>
	<20050418214338.GC8789@phlogiston.msp.redhat.com>
	<4264A742.5020508@dlr.de>
	<20050419161931.GE8789@phlogiston.msp.redhat.com>
	<4265EACB.5060305@dlr.de>
	<20050420170736.GH8789@phlogiston.msp.redhat.com>
	<42674119.4090605@dlr.de>
Message-ID: <20050422153646.GK8789@phlogiston.msp.redhat.com>

On Thu, Apr 21, 2005 at 07:58:49AM +0200, Hansj?rg Maurer wrote:
> Hi
> 
> thank you very much,
> Do you have any idea what causes the speed difference between vanilla 
> and RHEL4 kernel?
> Should I officially file a bug in bugzilla, or do you take care of the 
> problem?

Go ahead and file a bug on gnbd if you are interested in tracking the status
of this. I'll put it on my list of things to do.

-Ben
 
> Greetings
> 
> 
> Hansj?rg
> 
> 
> 
> 
> 
> >I got those numbers on a vanilla 2.6.11 kernel.
> >Running on a 2.6.9-6.24.EL kernel, I get
> >45.6608 MB/sec writes 
> >35.8578 MB/sec reads
> >
> >Not as pronouced as your numbers, but a noticeable speed difference.
> >
> >-Ben
> > 
> >
> >>Thank you very much
> >>
> >>Hansj?rg
> >>
> >>   
> >>
> >>>
> >>>
> >>>     
> >>>
> >>-- 
> >>_________________________________________________________________
> >>
> >>Dr.  Hansjoerg Maurer           | LAN- & System-Manager
> >>                              |
> >>Deutsches Zentrum               | DLR Oberpfaffenhofen
> >>f. Luft- und Raumfahrt e.V.   |
> >>Institut f. Robotik             |
> >>Postfach 1116                   | Muenchner Strasse 20
> >>82230 Wessling                  | 82234 Wessling
> >>Germany                         |
> >>                              |
> >>Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
> >>Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
> >>__________________________________________________________________
> >>
> >>
> >>There are 10 types of people in this world, 
> >>those who understand binary and those who don't.
> >>
> >>--
> >>Linux-cluster mailing list
> >>Linux-cluster at redhat.com
> >>http://www.redhat.com/mailman/listinfo/linux-cluster
> >>   
> >>
> >
> >--
> >Linux-cluster mailing list
> >Linux-cluster at redhat.com
> >http://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > 
> >
> 
> -- 
> _________________________________________________________________
> 
> Dr.  Hansjoerg Maurer           | LAN- & System-Manager
>                                |
> Deutsches Zentrum               | DLR Oberpfaffenhofen
>  f. Luft- und Raumfahrt e.V.   |
> Institut f. Robotik             |
> Postfach 1116                   | Muenchner Strasse 20
> 82230 Wessling                  | 82234 Wessling
> Germany                         |
>                                |
> Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
> Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
> __________________________________________________________________
> 
> 
> There are 10 types of people in this world, 
> those who understand binary and those who don't.
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From hansjoerg.maurer at dlr.de  Fri Apr 22 18:45:33 2005
From: hansjoerg.maurer at dlr.de (=?ISO-8859-1?Q?Hansj=F6rg_Maurer?=)
Date: Fri, 22 Apr 2005 20:45:33 +0200
Subject: [Linux-cluster] AW: GNBD multipath with devicemapper? -- possible
	solution
In-Reply-To: <20050422153646.GK8789@phlogiston.msp.redhat.com>
References: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de>	<20050418214338.GC8789@phlogiston.msp.redhat.com>	<4264A742.5020508@dlr.de>	<20050419161931.GE8789@phlogiston.msp.redhat.com>	<4265EACB.5060305@dlr.de>	<20050420170736.GH8789@phlogiston.msp.redhat.com>	<42674119.4090605@dlr.de>
	<20050422153646.GK8789@phlogiston.msp.redhat.com>
Message-ID: <4269464D.8080106@dlr.de>

Hi

I did further tests and I am not sure if it is gnbd related.
I did some tests with local ext3 and local gfs filesystem
on rhel4   2.6.9-6.36 and 2.6.9-6.38 with current cvs (RHEL4 tag).
ext3 reads and writes 2000MB in about 13-14s on our new testhardware.
gfs writes it in about 15s and needs about 40 s for reading (even with 
lock_nolock filesystem)

Maybe you can try this on an local RHEL4 gfs filesystem to, in order to 
be sure,
that my initially gnbd problem did not lead you to the wrong direction 
and the problem is gfs related.

nice weekend from munich

hansj?rg





Benjamin Marzinski wrote:

>On Thu, Apr 21, 2005 at 07:58:49AM +0200, Hansj?rg Maurer wrote:
>  
>
>>Hi
>>
>>thank you very much,
>>Do you have any idea what causes the speed difference between vanilla 
>>and RHEL4 kernel?
>>Should I officially file a bug in bugzilla, or do you take care of the 
>>problem?
>>    
>>
>
>Go ahead and file a bug on gnbd if you are interested in tracking the status
>of this. I'll put it on my list of things to do.
>
>-Ben
> 
>  
>
>>Greetings
>>
>>
>>Hansj?rg
>>
>>
>>
>>
>>
>>    
>>
>>>I got those numbers on a vanilla 2.6.11 kernel.
>>>Running on a 2.6.9-6.24.EL kernel, I get
>>>45.6608 MB/sec writes 
>>>35.8578 MB/sec reads
>>>
>>>Not as pronouced as your numbers, but a noticeable speed difference.
>>>
>>>-Ben
>>>
>>>
>>>      
>>>
>>>>Thank you very much
>>>>
>>>>Hansj?rg
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>    
>>>>>
>>>>>          
>>>>>
>>>>-- 
>>>>_________________________________________________________________
>>>>
>>>>Dr.  Hansjoerg Maurer           | LAN- & System-Manager
>>>>                             |
>>>>Deutsches Zentrum               | DLR Oberpfaffenhofen
>>>>f. Luft- und Raumfahrt e.V.   |
>>>>Institut f. Robotik             |
>>>>Postfach 1116                   | Muenchner Strasse 20
>>>>82230 Wessling                  | 82234 Wessling
>>>>Germany                         |
>>>>                             |
>>>>Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
>>>>Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
>>>>__________________________________________________________________
>>>>
>>>>
>>>>There are 10 types of people in this world, 
>>>>those who understand binary and those who don't.
>>>>
>>>>--
>>>>Linux-cluster mailing list
>>>>Linux-cluster at redhat.com
>>>>http://www.redhat.com/mailman/listinfo/linux-cluster
>>>>  
>>>>
>>>>        
>>>>
>>>--
>>>Linux-cluster mailing list
>>>Linux-cluster at redhat.com
>>>http://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>>
>>>      
>>>
>>-- 
>>_________________________________________________________________
>>
>>Dr.  Hansjoerg Maurer           | LAN- & System-Manager
>>                               |
>>Deutsches Zentrum               | DLR Oberpfaffenhofen
>> f. Luft- und Raumfahrt e.V.   |
>>Institut f. Robotik             |
>>Postfach 1116                   | Muenchner Strasse 20
>>82230 Wessling                  | 82234 Wessling
>>Germany                         |
>>                               |
>>Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
>>Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
>>__________________________________________________________________
>>
>>
>>There are 10 types of people in this world, 
>>those who understand binary and those who don't.
>>
>>--
>>Linux-cluster mailing list
>>Linux-cluster at redhat.com
>>http://www.redhat.com/mailman/listinfo/linux-cluster
>>    
>>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
>
>  
>



From tpcollier at liberty.edu  Fri Apr 22 20:44:25 2005
From: tpcollier at liberty.edu (Collier, Tirus)
Date: Fri, 22 Apr 2005 16:44:25 -0400
Subject: [Linux-cluster] GFS 6.0 on Kernel 2.4.xx
Message-ID: <5002B7C6DF422D48A2C8E1CF6397B1DD03ADE1CB@doc.University.liberty.edu>

Good Day All,

Request to know if anyone has any experience of successfully
installing/configuring a GFS 6.0 cluster on a RHL 2.4.xx kernel.

Please advise, thanks.

Tirus Collier
Liberty University
Systems Administration
1971 University Blvd.
Lynchburg, VA 24502
434.582.2822





From pbruna at linuxcenterla.com  Sat Apr 23 14:42:44 2005
From: pbruna at linuxcenterla.com (Patricio Bruna V.)
Date: Sat, 23 Apr 2005 10:42:44 -0400
Subject: [Linux-cluster] GFS 6.0 on Kernel 2.4.xx
In-Reply-To: <5002B7C6DF422D48A2C8E1CF6397B1DD03ADE1CB@doc.University.liberty.edu>
References: <5002B7C6DF422D48A2C8E1CF6397B1DD03ADE1CB@doc.University.liberty.edu>
Message-ID: <200504231042.50329.pbruna@linuxcenterla.com>

El Vie 22 Abr 2005 16:44, Collier, Tirus escribi?:
> Good Day All,
>
> Request to know if anyone has any experience of successfully
> installing/configuring a GFS 6.0 cluster on a RHL 2.4.xx kernel.
>
> Please advise, thanks.
>
i had not trouble to make it work. GFS 6.0, Cluster Suite and Red Hat AS 3.
but its not so cool like the new model on fedora devel
-- 
Patricio Bruna
pbruna at linuxcenterla.com
RHCE/RHCI
Jefe Soporte y Operaciones LinuxCenter S.A.
Canada 239, 5to piso, Providencia, Chile
http://www.linuxcenterla.com +56-2-2745000
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050423/c12ddaeb/attachment.sig>

From hansjoerg.maurer at dlr.de  Sun Apr 24 13:46:01 2005
From: hansjoerg.maurer at dlr.de (=?ISO-8859-1?Q?Hansj=F6rg_Maurer?=)
Date: Sun, 24 Apr 2005 15:46:01 +0200
Subject: bad (2 times) gfs read performance under x86_64 compared to i386
	WAS: Re: [Linux-cluster] AW: GNBD multipath with devicemapper? --
	possible solution
In-Reply-To: <20050422153646.GK8789@phlogiston.msp.redhat.com>
References: <4CE5177FBED2784FAC715DB5553BD8970A3F3E@exbe04.intra.dlr.de>	<20050418214338.GC8789@phlogiston.msp.redhat.com>	<4264A742.5020508@dlr.de>	<20050419161931.GE8789@phlogiston.msp.redhat.com>	<4265EACB.5060305@dlr.de>	<20050420170736.GH8789@phlogiston.msp.redhat.com>	<42674119.4090605@dlr.de>
	<20050422153646.GK8789@phlogiston.msp.redhat.com>
Message-ID: <426BA319.1080805@dlr.de>

Hi

I have done some futher testing on this issue, and I noticed an 
interesting behavior.
We have an installation with RHEL4 on i386 hardware (old SAN) and one under
new x86_64 (dual CPU each hyptherthreading).

On i386 read performance is better than write performance (under ext3 
and gfs)
On x86_64 read performance is better the write performance under ext3
but two times bader under gfs (Filsystem created with lock_nolock)

Here a short summary:
i386 ext3
write:    1:16
read:    0:55
read:    0:43 (blockdev --setra 8192)

i386 ext3
write:    1:30
read:    1:08
read:    1:08 (blockdev --setra 8192)


x86_64 ext3
write:    0:35
read:    0:27
read:    0:18 (blockdev --setra 8192)

x86_64 gfs
write:    0:27
read:    1:03
read:    1:03 (blockdev --setra 8192)


The gnbd tests I did last week were under x86_64 to, so that might not 
be an gnbd issue, but an gfs issue under
x86_64.
The tests where done under
 2.6.9-5.0.3.ELsmp and  2.6.9-6.38.ELsmp with current GFS from  CVS 
(RHEL4 TAG) with no difference.

Can anyone reproduce this behavior?

Greetings

Hansj?rg


Here the detailed tests


[root at chianti sda1]# uname -a
Linux chianti.itsd.de 2.6.9-6.38.EL #1 Wed Apr 13 01:36:09 EDT 2005 i686 
athloni386 GNU/Linux


[root at chianti ~]# mkfs.ext3 /dev/sda1
mke2fs 1.35 (28-Feb-2004)
max_blocks 4294967295, rsv_groups = 0, rsv_gdb = 1024
Dateisystem-Label=
OS-Typ: Linux
Blockgr???e=4096 (log=2)
Fragmentgr???e=4096 (log=2)
5341184 Inodes, 10673176 Bl??cke
533658 Bl??cke (5.00%) reserviert f??r den Superuser
erster Datenblock=0
Maximum filesystem blocks=12582912
326 Blockgruppen
32768 Bl??cke pro Gruppe, 32768 Fragmente pro Gruppe
16384 Inodes pro Gruppe
Superblock-Sicherungskopien gespeichert in den Bl??cken:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
2654208,
        4096000, 7962624

Schreibe Inode-Tabellen: erledigt
inode.i_blocks = 98312, i_size = 4243456
Erstelle Journal (8192 Bl??cke): erledigt
Schreibe Superbl??cke und Dateisystem-Accountinginformationen: erledigt

Das Dateisystem wird automatisch alle 28 Mounts bzw. alle 180 Tage 
??berpr??ft,
je nachdem, was zuerst eintritt. Ver??nderbar mit tune2fs -c oder -t .



[root at chianti ~]# mount /dev/sda1 /mnt/sda1/
[root at chianti ~]# cd /mnt/sda1
[root at chianti sda1]# free
             total       used       free     shared    buffers     cached
Mem:        515736     164504     351232          0      43544      25104
-/+ buffers/cache:      95856     419880
Swap:       786424        152     786272


[root at chianti sda1]# time mkfile 1000M a

real    1m16.242s
user    0m0.009s
sys     0m16.273s
[root at chianti sda1]# time mkfile 1000M b

real    1m17.738s
user    0m0.009s
sys     0m15.511s
[root at chianti sda1]# time cat a > /dev/null

real    0m55.025s
user    0m0.193s
sys     0m6.529s

[root at chianti sda1]# time cat b > /dev/null

real    0m54.241s
user    0m0.220s
sys     0m6.331s


[root at chianti sda1]# blockdev --setra 8192 /dev/sda1
[root at chianti sda1]# time cat a > /dev/null

real    0m43.565s
user    0m0.189s
sys     0m6.014s

[root at chianti sda1]# time cat b > /dev/null

real    0m43.214s
user    0m0.196s
sys     0m6.727s

[root at chianti ~]# gfs_mkfs -p lock_nolock -j 3 /dev/sda1
This will destroy any data on /dev/sda1.
  It appears to contain a EXT2/3 filesystem.

Are you sure you want to proceed? [y/n] y
Device:                    /dev/sda1
Blocksize:                 4096
Filesystem Size:           10573560
Journals:                  3
Resource Groups:           162
Locking Protocol:          lock_nolock
Lock Table:

Syncing...
All Done

[root at chianti sda1]# time mkfile 1000M a

real    1m27.421s
user    0m0.010s
sys     0m12.202s
[root at chianti sda1]# time mkfile 1000M b

real    1m35.009s
user    0m0.006s
sys     0m12.513s
[root at chianti sda1]# time cat a > /dev/null

real    1m12.609s
user    0m0.153s
sys     0m9.980s
[root at chianti sda1]# time cat b > /dev/null

real    1m7.989s
user    0m0.154s
sys     0m10.427s


[root at chianti sda1]# blockdev --setra 256 /dev/sda1
[root at chianti sda1]# time cat a > /dev/null

real    1m8.402s
user    0m0.082s
sys     0m8.841s
[root at chianti sda1]# time cat b > /dev/null

real    1m8.647s
user    0m0.124s
sys     0m9.565s
[root at chianti sda1]# blockdev --setra 8192 /dev/sda1
[root at chianti sda1]# time cat a > /dev/null

real    1m8.419s
user    0m0.115s
sys     0m9.262s



[root at rmvbs02 ~]# uname -a
Linux rmvbs02.cluster.robotic.dlr.de 2.6.9-5.0.3.ELsmp #1 SMP Sat Feb 19 
15:45:14 CST 2005 x86_64 x86_64 x86_64 GNU/Linux


[root at rmvbs02 ~]# mkfs.ext3 /dev/sdb1
mke2fs 1.35 (28-Feb-2004)
max_blocks 4294967295, rsv_groups = 131072, rsv_gdb = 977
Dateisystem-Label=
OS-Typ: Linux
Blockgr???e=4096 (log=2)
Fragmentgr???e=4096 (log=2)
97615872 Inodes, 195221872 Bl??cke
9761093 Bl??cke (5.00%) reserviert f??r den Superuser
erster Datenblock=0
Maximum filesystem blocks=4294967296
5958 Blockgruppen
32768 Bl??cke pro Gruppe, 32768 Fragmente pro Gruppe
16384 Inodes pro Gruppe
Superblock-Sicherungskopien gespeichert in den Bl??cken:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000

Schreibe Inode-Tabellen: erledigt
inode.i_blocks = 140696, i_size = 4243456
Erstelle Journal (8192 Bl??cke): erledigt
Schreibe Superbl??cke und Dateisystem-Accountinginformationen: erledigt

Das Dateisystem wird automatisch alle 38 Mounts bzw. alle 180 Tage 
??berpr??ft,
je nachdem, was zuerst eintritt. Ver??nderbar mit tune2fs -c oder -t .


[root at rmvbs02 tmp]# free
             total       used       free     shared    buffers     cached
Mem:       1026364      65960     960404          0        684      15544
-/+ buffers/cache:      49732     976632
Swap:      2739040       1360    2737680
[root at rmvbs02 tmp]# time mkfile 2000M a

real    0m35.560s
user    0m0.005s
sys     0m7.596s
[root at rmvbs02 tmp]# time mkfile 2000M b

real    0m30.917s
user    0m0.005s
sys     0m7.707s
[root at rmvbs02 tmp]# time mkfile 2000M c

real    0m40.663s
user    0m0.010s
sys     0m7.806s
[root at rmvbs02 tmp]# time cat a > /dev/null

real    0m28.787s
user    0m0.122s
sys     0m2.818s
[root at rmvbs02 tmp]# time cat b > /dev/null

real    0m27.468s
user    0m0.124s
sys     0m2.678s
[root at rmvbs02 tmp]# blockdev --getra /dev/sdb1
128
[root at rmvbs02 tmp]# blockdev --setra 8192 /dev/sdb1
[root at rmvbs02 tmp]# time cat c > /dev/null

real    0m18.541s
user    0m0.105s
sys     0m2.064s
[root at rmvbs02 tmp]# time cat a > /dev/null

real    0m18.464s
user    0m0.117s
sys     0m2.035s


[root at rmvbs02 ~]# gfs_mkfs -p lock_nolock -j 3 /dev/sdb1
This will destroy any data on /dev/sdb1.
  It appears to contain a EXT2/3 filesystem.

Are you sure you want to proceed? [y/n] y

Device:                    /dev/sdb1
Blocksize:                 4096
Filesystem Size:           195108656
Journals:                  3
Resource Groups:           2978
Locking Protocol:          lock_nolock
Lock Table:

Syncing...
All Done
[root at rmvbs02 ~]# mount -t gfs /dev/sdb1 /mnt/tmp/
[root at rmvbs02 ~]# cd /mnt/tmp/
[root at rmvbs02 tmp]# time mkfile 2000M a

real    0m24.616s
user    0m0.004s
sys     0m4.892s
[root at rmvbs02 tmp]# time mkfile 2000M b

real    0m27.023s
user    0m0.005s
sys     0m5.027s
[root at rmvbs02 tmp]# time mkfile 2000M c

real    0m29.205s
user    0m0.005s
sys     0m5.163s


[root at rmvbs02 tmp]# time cat a > /dev/null

real    1m4.698s
user    0m0.120s
sys     0m6.138s
[root at rmvbs02 tmp]# time cat b > /dev/null

real    1m2.958s
user    0m0.132s
sys     0m6.175s
[root at rmvbs02 tmp]# time cat c > /dev/null

real    1m2.867s
user    0m0.109s
sys     0m6.079s
[root at rmvbs02 tmp]# blockdev --getra  /dev/sdb1
8192
[root at rmvbs02 tmp]# blockdev --setra 256  /dev/sdb1
[root at rmvbs02 tmp]# time cat a > /dev/null

real    1m2.931s
user    0m0.101s
sys     0m6.073s





Benjamin Marzinski wrote:

>On Thu, Apr 21, 2005 at 07:58:49AM +0200, Hansj?rg Maurer wrote:
>  
>
>>Hi
>>
>>thank you very much,
>>Do you have any idea what causes the speed difference between vanilla 
>>and RHEL4 kernel?
>>Should I officially file a bug in bugzilla, or do you take care of the 
>>problem?
>>    
>>
>
>Go ahead and file a bug on gnbd if you are interested in tracking the status
>of this. I'll put it on my list of things to do.
>
>-Ben
> 
>  
>
>>Greetings
>>
>>
>>Hansj?rg
>>
>>
>>
>>
>>
>>    
>>
>>>I got those numbers on a vanilla 2.6.11 kernel.
>>>Running on a 2.6.9-6.24.EL kernel, I get
>>>45.6608 MB/sec writes 
>>>35.8578 MB/sec reads
>>>
>>>Not as pronouced as your numbers, but a noticeable speed difference.
>>>
>>>-Ben
>>>
>>>
>>>      
>>>
>>>>Thank you very much
>>>>
>>>>Hansj?rg
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>>>    
>>>>>
>>>>>          
>>>>>
>>>>-- 
>>>>_________________________________________________________________
>>>>
>>>>Dr.  Hansjoerg Maurer           | LAN- & System-Manager
>>>>                             |
>>>>Deutsches Zentrum               | DLR Oberpfaffenhofen
>>>>f. Luft- und Raumfahrt e.V.   |
>>>>Institut f. Robotik             |
>>>>Postfach 1116                   | Muenchner Strasse 20
>>>>82230 Wessling                  | 82234 Wessling
>>>>Germany                         |
>>>>                             |
>>>>Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
>>>>Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
>>>>__________________________________________________________________
>>>>
>>>>
>>>>There are 10 types of people in this world, 
>>>>those who understand binary and those who don't.
>>>>
>>>>--
>>>>Linux-cluster mailing list
>>>>Linux-cluster at redhat.com
>>>>http://www.redhat.com/mailman/listinfo/linux-cluster
>>>>  
>>>>
>>>>        
>>>>
>>>--
>>>Linux-cluster mailing list
>>>Linux-cluster at redhat.com
>>>http://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>>
>>>      
>>>
>>-- 
>>_________________________________________________________________
>>
>>Dr.  Hansjoerg Maurer           | LAN- & System-Manager
>>                               |
>>Deutsches Zentrum               | DLR Oberpfaffenhofen
>> f. Luft- und Raumfahrt e.V.   |
>>Institut f. Robotik             |
>>Postfach 1116                   | Muenchner Strasse 20
>>82230 Wessling                  | 82234 Wessling
>>Germany                         |
>>                               |
>>Tel: 08153/28-2431              | E-mail: Hansjoerg.Maurer at dlr.de
>>Fax: 08153/28-1134              | WWW: http://www.robotic.dlr.de/
>>__________________________________________________________________
>>
>>
>>There are 10 types of people in this world, 
>>those who understand binary and those who don't.
>>
>>--
>>Linux-cluster mailing list
>>Linux-cluster at redhat.com
>>http://www.redhat.com/mailman/listinfo/linux-cluster
>>    
>>
>
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
>
>  
>



From rajkum2002 at rediffmail.com  Mon Apr 25 17:53:43 2005
From: rajkum2002 at rediffmail.com (Raj  Kumar)
Date: 25 Apr 2005 17:53:43 -0000
Subject: [Linux-cluster] Out of Memory Problem
Message-ID: <20050425175343.18257.qmail@webmail29.rediffmail.com>

Hello,

I upgraded to the latest kernel and GFS version. Size-64 slab memory is growing continously and after sometime the system crashes with out of memory error. Since size-64 slab is growing the system is using lowmem region and when there is no free lowmem system explodes. 

The GFS filesystem is exported via NFS to user workstations. This configuration seems to be a problem since the memory is increasing only when users accesses the files via NFS exports. Is GFS+NFS stable yet?

Any advice is appreciated!
Thank you,
Raj


On Mon, 18 Apr 2005 Raj  Kumar wrote :
>Hi everone,
>
>cat /proc/slabinfo:
>
>size-64           4825410 4825410    128 160847 160847    1 : 1008  252
>
>This seems to be unusal... size-64 slab is consuming upto 643MB of RAM. This number seems to increase slowly... how to track which process is requesting the objects from this slab? Does anyone know if there is a bug related to this in RH 2.4.21-27.0.1.ELsmp kernel?
>
>Thank you,
>Raj
>
>
>
>On Mon, 18 Apr 2005 Raj  Kumar wrote :
> >Hi everyone,
> >
> >One of our GFS Linux servers has crashed twice yesterday. The log messages indicate the server ran out of memory and started killing processes:
> >
> >Out of Memory: Killed process 21188 (sshd).
> >Out of Memory: Killed process 5215 (xfs).
> >
> >The server is a HP DL380 with dual Xeon 3.06 GHz processor, 1GB RAM, 2GB swap space running RHEL 3.0- kernel 2.4.21-27.0.1.ELsmp. The server runs NIS, GFS, SSHD and samba services. After the first crash the server didn?t start due to file system corruption. The problem has been corrected and server returned to operation yesterday evening. Today's log indicates the server ran out of memory and killed processes again this morning. This out of memory problem is recurring speciallyl when users are accessing the storage mounted using GFS.
> >
> >Where can I start to debug the problem?
> >
> >free -m output:
> >
> >              total       used       free     shared    buffers     cached
> >Mem:          1001        986         14          0          1         79
> >-/+ buffers/cache:        905         95
> >Swap:         1996         49       1946
> >
> >I don't understand what's happening to the total 1GB memory. This is the free output that happened seconds before crash (I had swatch set up to log the statistics the moment it sees OOM messages). PS output doesn't show any process taking significant portion of memory either. Since this is happening only when users are using GFS heavily I suspect it is the problem. But how do I verify it? Is 1GB too small for a GFS server?
> >
> >I found that another user has seen the same problem before:
> >https://www.redhat.com/archives/linux-cluster/2005-January/msg00099.html
> >
> >GFS setup was fine and all our tests passed. We then moved it to production and it immediately failed after running for two days. Your help is very much appreciated!! The problem seems to be reproducible. So if you need any logs I can rerun what our users did at the time of crash.
> >
> >Thanks,
> >Raj
> >
> >================== Log ============================
> >
> >Apr  7 10:49:15 server1 kernel: Mem-info:
> >Apr  7 10:49:15 server1 kernel: Zone:DMA freepages:  2792 min:     0 low:     0 high:     0
> >Apr  7 10:49:15 server1 kernel: Zone:Normal freepages:   382 min:   766 low:  4031 high:  5791
> >Apr  7 10:49:15 server1 kernel: Zone:HighMem freepages:   287 min:   255 low:   510 high:   765
> >Apr  7 10:49:15 server1 kernel: Free pages:        3461 (   287 HighMem)
> >Apr  7 10:49:15 server1 kernel: ( Active: 22389/6071, inactive_laundry: 889, inactive_clean: 943, free: 3461 )
> >Apr  7 10:49:15 server1 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2792
> >Apr  7 10:49:15 server1 kernel:   aa:6 ac:13 id:292 il:43 ic:0 fr:382
> >Apr  7 10:49:15 server1 kernel:   aa:15159 ac:7211 id:5769 il:856 ic:943 fr:287
> >Apr  7 10:49:15 server1 kernel: 2*4kB 1*8kB 3*16kB 3*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11168kB) Apr  7 10:49:15 server1 kernel: 0*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1528kB) Apr  7 10:49:15 server1 kernel: 33*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1148kB) Apr  7 10:49:15 server1 kernel: Swap cache: add 1777, delete 1292, find 20425/20536, race 0+0 Apr  7 10:49:15 server1 kernel: 218499 pages of slabcache Apr  7 10:49:15 server1 kernel: 216 pages of kernel stacks Apr  7 10:49:16 server1 kernel: 0 lowmem pagetables, 489 highmem pagetables
> >Apr  7 10:49:16 server1 kernel: Free swap:       2038872kB
> >Apr  7 10:49:16 server1 kernel: 262138 pages of RAM Apr  7 10:49:16 server1 kernel: 32762 pages of HIGHMEM Apr  7 10:49:16 server1 kernel: 5780 reserved pages Apr  7 10:49:16 server1 kernel: 16752 pages shared Apr  7 10:49:16 server1 kernel: 485 pages swap cached Apr  7 10:49:16 server1 kernel: Out of Memory: Killed process 21188 (sshd).
> >Apr  7 10:49:16 server1 kernel: Out of Memory: Killed process 21188 (sshd).
> >Apr  7 10:49:20 server1 kernel: Mem-info:
> >Apr  7 10:49:20 server1 kernel: Zone:DMA freepages:  2792 min:     0 low:     0 high:     0
> >Apr  7 10:49:20 server1 kernel: Zone:Normal freepages:   382 min:   766 low:  4031 high:  5791
> >Apr  7 10:49:20 server1 kernel: Zone:HighMem freepages:   291 min:   255 low:   510 high:   765
> >Apr  7 10:49:20 server1 kernel: Free pages:        3465 (   291 HighMem)
> >Apr  7 10:49:20 server1 kernel: ( Active: 21743/6636, inactive_laundry: 896, inactive_clean: 1049, free: 3465 )
> >Apr  7 10:49:20 server1 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2792
> >Apr  7 10:49:20 server1 kernel:   aa:6 ac:36 id:265 il:40 ic:0 fr:382
> >Apr  7 10:49:20 server1 kernel:   aa:14479 ac:7222 id:6365 il:862 ic:1049 fr:291
> >Apr  7 10:49:20 server1 kernel: 2*4kB 1*8kB 3*16kB 3*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11168kB) Apr  7 10:49:20 server1 kernel: 28*4kB 1*8kB 2*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1528kB) Apr  7 10:49:20 server1 kernel: 37*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1164kB) Apr  7 10:49:20 server1 kernel: Swap cache: add 1777, delete 1292, find 20425/20536, race 0+0 Apr  7 10:49:21 server1 kernel: 218570 pages of slabcache Apr  7 10:49:21 server1 kernel: 196 pages of kernel stacks Apr  7 10:49:21 server1 kernel: 0 lowmem pagetables, 404 highmem pagetables
> >Apr  7 10:49:21 server1 kernel: Free swap:       2038872kB
> >Apr  7 10:49:21 server1 kernel: 262138 pages of RAM Apr  7 10:49:21 server1 kernel: 32762 pages of HIGHMEM Apr  7 10:49:21 server1 kernel: 5780 reserved pages Apr  7 10:49:22 server1 kernel: 13904 pages shared Apr  7 10:49:22 server1 kernel: 485 pages swap cached Apr  7 10:49:22 server1 kernel: Out of Memory: Killed process 5215 (xfs).
> >Apr  7 10:49:22 server1 kernel: Out of Memory: Killed process 5215 (xfs).
> >.........
> >............
> >--
> >Linux-cluster mailing list
> >Linux-cluster at redhat.com
> >http://www.redhat.com/mailman/listinfo/linux-cluster
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050425/c3344161/attachment.htm>

From birger at birger.sh  Tue Apr 26 11:06:56 2005
From: birger at birger.sh (birger)
Date: Tue, 26 Apr 2005 13:06:56 +0200
Subject: [Linux-cluster] Problem with nfsclient.sh status checking
Message-ID: <426E20D0.6030509@birger.sh>

I have now looked into the error messages I get in /var/log/messages 
saying things like
Apr 26 13:03:08 server1 clurgmgrd[4969]: <notice> status on nfsclient 
"nis-hosts" returned 1 (generic error)

The problem is in nfsclient.sh. It uses the output from exportfs to 
check that file systems are still exported. When the mount point name is 
longer than 14 characters exportfs breaks the line. This breaks 
nfsclient.sh.

-- 
birger



From birger at uib.no  Tue Apr 26 11:22:21 2005
From: birger at uib.no (Birger Wathne)
Date: Tue, 26 Apr 2005 13:22:21 +0200
Subject: [Linux-cluster] Problem with nfsclient.sh status checking
In-Reply-To: <426E20D0.6030509@birger.sh>
References: <426E20D0.6030509@birger.sh>
Message-ID: <426E246D.90904@uib.no>

birger wrote:

> I have now looked into the error messages I get in /var/log/messages 
> saying things like
> Apr 26 13:03:08 server1 clurgmgrd[4969]: <notice> status on nfsclient 
> "nis-hosts" returned 1 (generic error)
>
> The problem is in nfsclient.sh. It uses the output from exportfs to 
> check that file systems are still exported. When the mount point name 
> is longer than 14 characters exportfs breaks the line. This breaks 
> nfsclient.sh.
>
A little more info... Whenever the status check failed the client got 
stale NFS handle for a little while. I guess the entry got unexported 
and reexported. Stale NFS handles are a bad thing. Hanging NFS service 
is better from a data integrity point of view.

The very quick and dirty fix follows. Edit status check in 
/usr/share/cluster/nfsclient.sh to look like this
        exportfs -v | tr -d "\n" | sed -e 's/([^)]*)/\n/g'| grep -q 
"^${OCF_RESKEY_path}[\t ]*.*${OCF_RESKEY_target}"

the -v option to exportfs is used to give an easy way to put line breaks 
in just where they are supposed to be.

-- 
birger




From pbruna at linuxcenterla.com  Wed Apr 27 13:53:19 2005
From: pbruna at linuxcenterla.com (Patricio Bruna V.)
Date: Wed, 27 Apr 2005 09:53:19 -0400
Subject: [Linux-cluster] GFS poor performance
Message-ID: <200504270953.24090.pbruna@linuxcenterla.com>

Its normal that GFS goes slower than ext3?
i have the following scenario:

2 Dell 2.4G Xeon (dual) 
2 HBA each
1 500GB Storage.

i have configured GFS 6 and make some test on it and ext3 its far faster.
-- 
Patricio Bruna
pbruna at linuxcenterla.com
RHCE/RHCI
Jefe Soporte y Operaciones LinuxCenter S.A.
Canada 239, 5to piso, Providencia, Chile
http://www.linuxcenterla.com +56-2-2745000
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050427/62c70220/attachment.sig>

From jbrassow at redhat.com  Wed Apr 27 15:49:25 2005
From: jbrassow at redhat.com (Jonathan E Brassow)
Date: Wed, 27 Apr 2005 10:49:25 -0500
Subject: [Linux-cluster] GFS poor performance
In-Reply-To: <200504270953.24090.pbruna@linuxcenterla.com>
References: <200504270953.24090.pbruna@linuxcenterla.com>
Message-ID: <64cfa246cf367cceda42b35d8c02aa46@redhat.com>

It really depends on what you are doing.

GFS, in it's current state, favors large I/O.  You can get very good 
results if separate machines are simultaneously doing large block I/O.

For now, think of it like this: 1) the larger the block I/O, the better 
the performance 2) simultaneous non-contending I/O from different 
machines increase throughput.

You are probably testing from one machine with smaller I/O - violating 
both 1 & 2 above.  In which case, ext3 - a file system designed to be 
local (as opposed to clustered) - should be faster.

That being said, there are a number of improvements coming in GFS that 
will greatly improve workloads like the one you are testing.

Also, there are HA advantages to consider.  Performance vs High 
Availability...  Hopefully, in the not to distant future, you will not 
have to choose, but will get both from GFS.

  brassow

One way to think about the current state of GFS is to compare it to 
thread programming.  If you add threading to your program, but it has 
to do alot of locking and only runs on one processor; you've just 
killed all your performance.  If you add a processor (or machines in 
GFS's case), it is possible that you will see speedup.  If you increase 
your work chunks (increase I/O size) and reduces locking (reduce 
contention), you will see great improvements.

On Apr 27, 2005, at 8:53 AM, Patricio Bruna V. wrote:

> Its normal that GFS goes slower than ext3?
> i have the following scenario:
>
> 2 Dell 2.4G Xeon (dual)
> 2 HBA each
> 1 500GB Storage.
>
> i have configured GFS 6 and make some test on it and ext3 its far 
> faster.
> -- 
> Patricio Bruna
> pbruna at linuxcenterla.com
> RHCE/RHCI
> Jefe Soporte y Operaciones LinuxCenter S.A.
> Canada 239, 5to piso, Providencia, Chile
> http://www.linuxcenterla.com +56-2-2745000
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From scd at broked.org  Thu Apr 28 07:36:45 2005
From: scd at broked.org (Steven Dake)
Date: Thu, 28 Apr 2005 00:36:45 -0700
Subject: [Linux-cluster] better integration between openais and redhat
	cluster suite (group and dlm)
Message-ID: <1114673805.11753.198.camel@slickdeal.broked.org>

Dave
find attached some patches to remove the requirement to use both the clm
and evs libraries for cluster/dlm and cluster/group openais
integration.  Instead a function is added to the evs library to
determine the local node identifier.  Give it a spin and let me know
what you think...

I also fixed up a few makefile errors where i couldn't compile the dlm
and group from CVS.

regards
-steve
-------------- next part --------------
A non-text attachment was scrubbed...
Name: openais-getmemb.patch
Type: text/x-patch
Size: 16932 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050428/006d8fb8/attachment.bin>
-------------- next part --------------
? group/list.h
? group/daemon/groupd
? group/daemon/list2.h
Index: group/daemon/Makefile.openais
===================================================================
RCS file: /cvs/cluster/cluster/group/daemon/Makefile.openais,v
retrieving revision 1.1
diff -u -r1.1 Makefile.openais
--- group/daemon/Makefile.openais	19 Apr 2005 07:00:42 -0000	1.1
+++ group/daemon/Makefile.openais	28 Apr 2005 07:33:21 -0000
@@ -10,7 +10,7 @@
 ###############################################################################
 ###############################################################################
 
-CFLAGS+= -g -I. -I../../../openais/include/
+CFLAGS+= -g -I../include -I../../../openais/include/
 
 TARGET=groupd
 
@@ -23,8 +23,7 @@
 	update.o \
 	done.o \
 	recover.o \
-	../../../openais/lib/libevs.a \
-	../../../openais/lib/libSaClm.a
+	../../../openais/lib/libevs.a
 	$(CC) $(LDFLAGS) -o $@ $^
 
 main.o: main.c gd_internal.h
Index: group/daemon/openais.c
===================================================================
RCS file: /cvs/cluster/cluster/group/daemon/openais.c,v
retrieving revision 1.1
diff -u -r1.1 openais.c
--- group/daemon/openais.c	19 Apr 2005 07:00:42 -0000	1.1
+++ group/daemon/openais.c	28 Apr 2005 07:33:21 -0000
@@ -38,8 +38,6 @@
 
 #include "gd_internal.h"
 #include "evs.h"
-#include "ais_types.h"
-#include "saClm.h"
 
 extern struct list_head gd_nodes;
 extern int              gd_node_count;
@@ -231,42 +229,14 @@
 	return 0;
 }
 
-/* All this SaClm stuff just to get the local nodeid which isn't
-   available through the evs api. */
-
-void foo(SaInvocationT i, const SaClmClusterNodeT *node, SaAisErrorT error)
-{
-}
-
-void bar (const SaClmClusterNotificationBufferT *b, SaUint32T n, SaAisErrorT e)
-{
-}
-
-SaClmCallbacksT clm_callbacks = {
-	.saClmClusterNodeGetCallback = foo,
-	.saClmClusterTrackCallback = bar
-};
-
 int set_our_nodeid(void)
 {
-	SaVersionT version = { 'B', 1, 1 };
-	SaClmHandleT handle;
-	SaClmClusterNodeT node;
-	int rv;
-
-	rv = saClmInitialize(&handle, &clm_callbacks, &version);
-	if (rv != SA_OK) {
-		log_print("saClmInitialize error %d %d", rv, errno);
-		return rv;
-	}
-
-	rv = saClmClusterNodeGet(handle, SA_CLM_LOCAL_NODE_ID, 0, &node);
-
-	gd_nodeid = (int) node.nodeId;
+	struct in_addr addr;
+	int member_list_entries = 0;
 
-	saClmFinalize(handle);
+	evs_membership_get (eh, &addr, NULL, &member_list_entries);
 
-	log_in("member our nodeid %d rv %d", gd_nodeid, rv);
+	gd_nodeid = addr.s_addr;
 
 	return 0;
 }
-------------- next part --------------
? dlm/daemon/dlm_controld
? dlm/make/defines.mk
Index: dlm/daemon/Makefile.openais
===================================================================
RCS file: /cvs/cluster/cluster/dlm/daemon/Makefile.openais,v
retrieving revision 1.1
diff -u -r1.1 Makefile.openais
--- dlm/daemon/Makefile.openais	18 Apr 2005 10:02:26 -0000	1.1
+++ dlm/daemon/Makefile.openais	28 Apr 2005 07:32:35 -0000
@@ -17,7 +17,7 @@
 
 CFLAGS+= -g -I${incdir} -I${top_srcdir}/config
 
-CFLAGS+= -I../../dlm-kernel/src2/ -I../../../openais/include/
+CFLAGS+= -I../../group/daemon -I../../dlm-kernel/src2/ -I../../../openais/include/
 
 
 TARGET=dlm_controld
Index: dlm/daemon/member_openais.c
===================================================================
RCS file: /cvs/cluster/cluster/dlm/daemon/member_openais.c,v
retrieving revision 1.1
diff -u -r1.1 member_openais.c
--- dlm/daemon/member_openais.c	18 Apr 2005 10:02:26 -0000	1.1
+++ dlm/daemon/member_openais.c	28 Apr 2005 07:32:35 -0000
@@ -20,8 +20,6 @@
 #include <arpa/inet.h>
 
 #include "evs.h"
-#include "ais_types.h"
-#include "saClm.h"
 
 #define MAX_NODES	(256)
 
@@ -120,48 +118,16 @@
         return 0;
 }
 
-/* All this SaClm stuff just to get the local nodeid which isn't
-   available through the evs api. */
-
-void foo(SaInvocationT i, const SaClmClusterNodeT *node, SaAisErrorT error)
-{
-}
-
-void bar (const SaClmClusterNotificationBufferT *b, SaUint32T n, SaAisErrorT e)
-{
-}
-
-SaClmCallbacksT clm_callbacks = {
-        .saClmClusterNodeGetCallback = foo,
-        .saClmClusterTrackCallback = bar
-};
-
 int set_our_nodeid(void)
 {
-        SaVersionT version = { 'B', 1, 1 };
-        SaClmHandleT handle;
-        SaClmClusterNodeT node;
-	struct in_addr a;
-        int rv;
+	struct in_addr addr;
+	int member_list_entries = 0;
 
-        rv = saClmInitialize(&handle, &clm_callbacks, &version);
-        if (rv != SA_OK) {
-                log_error("saClmInitialize error %d %d", rv, errno);
-                return rv;
-        }
+	evs_membership_get (eh, &addr, NULL, &member_list_entries);
 
-        rv = saClmClusterNodeGet(handle, SA_CLM_LOCAL_NODE_ID, 0, &node);
-        if (rv != SA_OK) {
-                log_error("saClmClusterNodeGet error %d %d", rv, errno);
-                return rv;
-        }
+	gd_nodeid = addr.s_addr;
 
-        saClmFinalize(handle);
-
-	a.s_addr = node.nodeId;
-	do_set_local((int) node.nodeId, &a);
-
-        return 0;
+	return 0;
 }
 
 static void dummy(struct in_addr source_addr, void *msg, int len)

From manca at link.it  Thu Apr 28 14:42:44 2005
From: manca at link.it (Andrea Manca)
Date: Thu, 28 Apr 2005 16:42:44 +0200
Subject: [Linux-cluster] Mirroring GFS
Message-ID: <1114699363.10554.1437.camel@wall>

Hi to all.

I've a couple of question about interaction between MD devices (software
RAID), GNBD, and GFS.

I'm testing a cluster with four nodes: two gfs and two gnbd nodes.
GNBD1 and GNBD2 export each one a partition over the LAN;
GFS1 and GFS2 import that exported block devices.

I'm interested in mirroring the information in GNBD1 over GNBD2, so i've
thought about creating a md device composed by

/dev/gnbd/GNBD1_PRIMARY
/dev/gnbd/GNBD2_MIRROR

in RAID1 configuration and gfs formatted.
It seem's a reasonable configuration for me, but in:

https://www.redhat.com/archives/linux-cluster/2004-October/msg00055.html

a similar solution (except for a RAID5 configuration), is discussed and
marked as impossible, because Software Raid in linux isn't cluster
aware. This could lead an inconsistency in parity blocks which aren't
locked correctly.
In my opinion two mirrored disks doesn't need a block level locking,
because parity block aren't present.

i've tried this on 4 VMWare machines and it seem work well if you write
(or read) something from GFSx nodes.

Could my solution get working or there is a similar issue applicable in
my situation?
In this second case, are any workarounds possible?


I really appreciate any comment or explanation.

Thank you all in advance.


Andrea.
manca at link.it



From bmarzins at redhat.com  Thu Apr 28 19:17:57 2005
From: bmarzins at redhat.com (Benjamin Marzinski)
Date: Thu, 28 Apr 2005 14:17:57 -0500
Subject: [Linux-cluster] Mirroring GFS
In-Reply-To: <1114699363.10554.1437.camel@wall>
References: <1114699363.10554.1437.camel@wall>
Message-ID: <20050428191757.GH31255@phlogiston.msp.redhat.com>

On Thu, Apr 28, 2005 at 04:42:44PM +0200, Andrea Manca wrote:
> Hi to all.
> 
> I've a couple of question about interaction between MD devices (software
> RAID), GNBD, and GFS.
> 
> I'm testing a cluster with four nodes: two gfs and two gnbd nodes.
> GNBD1 and GNBD2 export each one a partition over the LAN;
> GFS1 and GFS2 import that exported block devices.
> 
> I'm interested in mirroring the information in GNBD1 over GNBD2, so i've
> thought about creating a md device composed by
> 
> /dev/gnbd/GNBD1_PRIMARY
> /dev/gnbd/GNBD2_MIRROR
> 
> in RAID1 configuration and gfs formatted.
> It seem's a reasonable configuration for me, but in:
> 
> https://www.redhat.com/archives/linux-cluster/2004-October/msg00055.html
> 
> a similar solution (except for a RAID5 configuration), is discussed and
> marked as impossible, because Software Raid in linux isn't cluster
> aware. This could lead an inconsistency in parity blocks which aren't
> locked correctly.
> In my opinion two mirrored disks doesn't need a block level locking,
> because parity block aren't present.
> 
> i've tried this on 4 VMWare machines and it seem work well if you write
> (or read) something from GFSx nodes.
> 
> Could my solution get working or there is a similar issue applicable in
> my situation?
> In this second case, are any workarounds possible?

Unfortunately, this won't work.  In your situation, say that GFS1 dies while
in the middle of a write, so that the write completes to GNBD1, but not GNBD2.
The mirror would be out of sync, but there would be no way for GFS2 to know
that.  There is cluster aware mirroring code under development that will fix
this problem.

Sorry
-Ben
 
> I really appreciate any comment or explanation.
> 
> Thank you all in advance.
> 
> 
> Andrea.
> manca at link.it
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster



From phillips at redhat.com  Thu Apr 28 22:54:03 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Thu, 28 Apr 2005 18:54:03 -0400
Subject: [Linux-cluster] Mirroring GFS
In-Reply-To: <1114699363.10554.1437.camel@wall>
References: <1114699363.10554.1437.camel@wall>
Message-ID: <200504281854.03353.phillips@redhat.com>

On Thursday 28 April 2005 10:42, Andrea Manca wrote:
> I've a couple of question about interaction between MD devices
> (software RAID), GNBD, and GFS.
>
> I'm testing a cluster with four nodes: two gfs and two gnbd nodes.
> GNBD1 and GNBD2 export each one a partition over the LAN;
> GFS1 and GFS2 import that exported block devices.
>
> I'm interested in mirroring the information in GNBD1 over GNBD2, so
> i've thought about creating a md device composed by
>
> /dev/gnbd/GNBD1_PRIMARY
> /dev/gnbd/GNBD2_MIRROR
>
> in RAID1 configuration and gfs formatted.
> It seem's a reasonable configuration for me, but in:
>
> https://www.redhat.com/archives/linux-cluster/2004-October/msg00055.h
>tml
>
> a similar solution (except for a RAID5 configuration), is discussed
> and marked as impossible, because Software Raid in linux isn't
> cluster aware.

This is the block driver you are looking for:

   http://sourceware.org/cluster/ddraid/

> This could lead an inconsistency in parity blocks 
> which aren't locked correctly.
> In my opinion two mirrored disks doesn't need a block level locking,
> because parity block aren't present.
>
> i've tried this on 4 VMWare machines and it seem work well if you
> write (or read) something from GFSx nodes.
>
> Could my solution get working or there is a similar issue applicable
> in my situation?
> In this second case, are any workarounds possible?
>
>
> I really appreciate any comment or explanation.

Straight-up md raid5 does read-before-write with no global 
synchronization.  GFS does not know about this and can't know about it.   
Neither does md.  If it works for you, you are just lucky - this time.

Besides that problem, there is the matter of synchronizing resync IO 
across the cluster and interfacing to the cluster membership etc.  
DDRaid addresses all of the above.

But note that ddraid is still pre-alpha, because a few bits aren't fully 
hooked up yet, and some interfaces are still in flux.  I'll update the 
tarball to  (ddraid.0.0.6) later tonight, and post the lca slides.

Regards,

Daniel



From phillips at redhat.com  Thu Apr 28 22:57:30 2005
From: phillips at redhat.com (Daniel Phillips)
Date: Thu, 28 Apr 2005 18:57:30 -0400
Subject: [Linux-cluster] Mirroring GFS
In-Reply-To: <200504281854.03353.phillips@redhat.com>
References: <1114699363.10554.1437.camel@wall>
	<200504281854.03353.phillips@redhat.com>
Message-ID: <200504281857.30154.phillips@redhat.com>

On Thursday 28 April 2005 18:54, I mistyped:
> ...Straight-up md raid5 does read-before-write with no global
> synchronization.  GFS does not know about this and can't know about
> it. Neither does md...

I should have written "neither does the md device that is running on 
some other node".

Regards,

Daniel



From ptr at poczta.fm  Fri Apr 29 11:14:44 2005
From: ptr at poczta.fm (ptr at poczta.fm)
Date: 29 Apr 2005 13:14:44 +0200
Subject: [Linux-cluster] error: LOCK_USE_CLNT undeclared
Message-ID: <20050429111444.C130EEB31A@poczta.interia.pl>

   Hello.

   As I'd like to test supposedly more stable GFS confguration, I downloaded RHEL4 branch of /cluster from the CVS. Unfortunatelly the cluster component won't compile:
I'm getting error on "make" with vanilla kernels 2.6.11, 2.6.10 and older:
/install/GFS/cluster/gfs-kernel/src/nolock/main.c: In function `nolock_plock_get':/install/GFS/cluster/gfs-kernel/src/nolock/main.c:245: error: `LOCK_USE_CLNT' undeclared (first use in this function)

   Any help greatly appreciated.
Regards,

Piotr



----------------------------------------------------------------------
PHP, cgi i MySQL w standardzie >>> http://link.interia.pl/f1878 




From sdake at mvista.com  Thu Apr 28 22:44:52 2005
From: sdake at mvista.com (Steven Dake)
Date: Thu, 28 Apr 2005 15:44:52 -0700
Subject: [Linux-cluster] better integration between openais and redhat
	dlm/group
Message-ID: <1114728291.1497.33.camel@persist.az.mvista.com>

Dave
find attached some patches to remove the requirement to use both the clm
and evs libraries for cluster/dlm and cluster/group openais
integration.  Instead a function is added to the evs library to
determine the local node identifier.  Give it a spin and let me know
what you think...

I also fixed up a few makefile errors where i couldn't compile the dlm
and group from CVS.

regards
-steve
-------------- next part --------------
A non-text attachment was scrubbed...
Name: openais-getmemb.patch
Type: text/x-patch
Size: 16932 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050428/03e6d548/attachment.bin>
-------------- next part --------------
? group/list.h
? group/daemon/groupd
? group/daemon/list2.h
Index: group/daemon/Makefile.openais
===================================================================
RCS file: /cvs/cluster/cluster/group/daemon/Makefile.openais,v
retrieving revision 1.1
diff -u -r1.1 Makefile.openais
--- group/daemon/Makefile.openais	19 Apr 2005 07:00:42 -0000	1.1
+++ group/daemon/Makefile.openais	28 Apr 2005 07:33:21 -0000
@@ -10,7 +10,7 @@
 ###############################################################################
 ###############################################################################
 
-CFLAGS+= -g -I. -I../../../openais/include/
+CFLAGS+= -g -I../include -I../../../openais/include/
 
 TARGET=groupd
 
@@ -23,8 +23,7 @@
 	update.o \
 	done.o \
 	recover.o \
-	../../../openais/lib/libevs.a \
-	../../../openais/lib/libSaClm.a
+	../../../openais/lib/libevs.a
 	$(CC) $(LDFLAGS) -o $@ $^
 
 main.o: main.c gd_internal.h
Index: group/daemon/openais.c
===================================================================
RCS file: /cvs/cluster/cluster/group/daemon/openais.c,v
retrieving revision 1.1
diff -u -r1.1 openais.c
--- group/daemon/openais.c	19 Apr 2005 07:00:42 -0000	1.1
+++ group/daemon/openais.c	28 Apr 2005 07:33:21 -0000
@@ -38,8 +38,6 @@
 
 #include "gd_internal.h"
 #include "evs.h"
-#include "ais_types.h"
-#include "saClm.h"
 
 extern struct list_head gd_nodes;
 extern int              gd_node_count;
@@ -231,42 +229,14 @@
 	return 0;
 }
 
-/* All this SaClm stuff just to get the local nodeid which isn't
-   available through the evs api. */
-
-void foo(SaInvocationT i, const SaClmClusterNodeT *node, SaAisErrorT error)
-{
-}
-
-void bar (const SaClmClusterNotificationBufferT *b, SaUint32T n, SaAisErrorT e)
-{
-}
-
-SaClmCallbacksT clm_callbacks = {
-	.saClmClusterNodeGetCallback = foo,
-	.saClmClusterTrackCallback = bar
-};
-
 int set_our_nodeid(void)
 {
-	SaVersionT version = { 'B', 1, 1 };
-	SaClmHandleT handle;
-	SaClmClusterNodeT node;
-	int rv;
-
-	rv = saClmInitialize(&handle, &clm_callbacks, &version);
-	if (rv != SA_OK) {
-		log_print("saClmInitialize error %d %d", rv, errno);
-		return rv;
-	}
-
-	rv = saClmClusterNodeGet(handle, SA_CLM_LOCAL_NODE_ID, 0, &node);
-
-	gd_nodeid = (int) node.nodeId;
+	struct in_addr addr;
+	int member_list_entries = 0;
 
-	saClmFinalize(handle);
+	evs_membership_get (eh, &addr, NULL, &member_list_entries);
 
-	log_in("member our nodeid %d rv %d", gd_nodeid, rv);
+	gd_nodeid = addr.s_addr;
 
 	return 0;
 }
-------------- next part --------------
? dlm/daemon/dlm_controld
? dlm/make/defines.mk
Index: dlm/daemon/Makefile.openais
===================================================================
RCS file: /cvs/cluster/cluster/dlm/daemon/Makefile.openais,v
retrieving revision 1.1
diff -u -r1.1 Makefile.openais
--- dlm/daemon/Makefile.openais	18 Apr 2005 10:02:26 -0000	1.1
+++ dlm/daemon/Makefile.openais	28 Apr 2005 07:32:35 -0000
@@ -17,7 +17,7 @@
 
 CFLAGS+= -g -I${incdir} -I${top_srcdir}/config
 
-CFLAGS+= -I../../dlm-kernel/src2/ -I../../../openais/include/
+CFLAGS+= -I../../group/daemon -I../../dlm-kernel/src2/ -I../../../openais/include/
 
 
 TARGET=dlm_controld
Index: dlm/daemon/member_openais.c
===================================================================
RCS file: /cvs/cluster/cluster/dlm/daemon/member_openais.c,v
retrieving revision 1.1
diff -u -r1.1 member_openais.c
--- dlm/daemon/member_openais.c	18 Apr 2005 10:02:26 -0000	1.1
+++ dlm/daemon/member_openais.c	28 Apr 2005 07:32:35 -0000
@@ -20,8 +20,6 @@
 #include <arpa/inet.h>
 
 #include "evs.h"
-#include "ais_types.h"
-#include "saClm.h"
 
 #define MAX_NODES	(256)
 
@@ -120,48 +118,16 @@
         return 0;
 }
 
-/* All this SaClm stuff just to get the local nodeid which isn't
-   available through the evs api. */
-
-void foo(SaInvocationT i, const SaClmClusterNodeT *node, SaAisErrorT error)
-{
-}
-
-void bar (const SaClmClusterNotificationBufferT *b, SaUint32T n, SaAisErrorT e)
-{
-}
-
-SaClmCallbacksT clm_callbacks = {
-        .saClmClusterNodeGetCallback = foo,
-        .saClmClusterTrackCallback = bar
-};
-
 int set_our_nodeid(void)
 {
-        SaVersionT version = { 'B', 1, 1 };
-        SaClmHandleT handle;
-        SaClmClusterNodeT node;
-	struct in_addr a;
-        int rv;
+	struct in_addr addr;
+	int member_list_entries = 0;
 
-        rv = saClmInitialize(&handle, &clm_callbacks, &version);
-        if (rv != SA_OK) {
-                log_error("saClmInitialize error %d %d", rv, errno);
-                return rv;
-        }
+	evs_membership_get (eh, &addr, NULL, &member_list_entries);
 
-        rv = saClmClusterNodeGet(handle, SA_CLM_LOCAL_NODE_ID, 0, &node);
-        if (rv != SA_OK) {
-                log_error("saClmClusterNodeGet error %d %d", rv, errno);
-                return rv;
-        }
+	gd_nodeid = addr.s_addr;
 
-        saClmFinalize(handle);
-
-	a.s_addr = node.nodeId;
-	do_set_local((int) node.nodeId, &a);
-
-        return 0;
+	return 0;
 }
 
 static void dummy(struct in_addr source_addr, void *msg, int len)

From lhh at redhat.com  Fri Apr 29 14:37:40 2005
From: lhh at redhat.com (Lon Hohberger)
Date: Fri, 29 Apr 2005 10:37:40 -0400
Subject: [Linux-cluster] Problem with nfsclient.sh status checking
In-Reply-To: <426E246D.90904@uib.no>
References: <426E20D0.6030509@birger.sh>  <426E246D.90904@uib.no>
Message-ID: <1114785460.20618.580.camel@ayanami.boston.redhat.com>

On Tue, 2005-04-26 at 13:22 +0200, Birger Wathne wrote:
> The very quick and dirty fix follows. Edit status check in 
> /usr/share/cluster/nfsclient.sh to look like this
>         exportfs -v | tr -d "\n" | sed -e 's/([^)]*)/\n/g'| grep -q 
> "^${OCF_RESKEY_path}[\t ]*.*${OCF_RESKEY_target}"
> 
> the -v option to exportfs is used to give an easy way to put line breaks 
> in just where they are supposed to be.

Hi Birger,

I filed:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=156369

...for tracking.  I've committed the patch to CVS into the RHEL4 branch
& mainline.  Sorry for the late response.

-- Lon



From rajkum2002 at rediffmail.com  Fri Apr 29 17:04:54 2005
From: rajkum2002 at rediffmail.com (Raj  Kumar)
Date: 29 Apr 2005 17:04:54 -0000
Subject: [Linux-cluster] Out of Memory Problem
Message-ID: <20050429170454.19183.qmail@webmail29.rediffmail.com>

Hello all,

It looks like using "-o acl" to mount GFS filesystem is causing memory leak. Mounting GFS with ACL enabled and defaults set for the mount point size-64 slab is growing as the files are created. I remounted without "-o acl" and size-64 doesnt grow so there is no memory leak. This is reproducible. If you need more information or recommend me to file a bug in bugzilla I would be more than happy to do so.

Thank you,
Raj

On Mon, 18 Apr 2005 Raj  Kumar wrote :
>Hi everyone,
>
>One of our GFS Linux servers has crashed twice yesterday. The log messages indicate the server ran out of memory and started killing processes:
>
>Out of Memory: Killed process 21188 (sshd).
>Out of Memory: Killed process 5215 (xfs).
>
>The server is a HP DL380 with dual Xeon 3.06 GHz processor, 1GB RAM, 2GB swap space running RHEL 3.0- kernel 2.4.21-27.0.1.ELsmp. The server runs NIS, GFS, SSHD and samba services. After the first crash the server didn?t start due to file system corruption. The problem has been corrected and server returned to operation yesterday evening. Today's log indicates the server ran out of memory and killed processes again this morning. This out of memory problem is recurring speciallyl when users are accessing the storage mounted using GFS.
>
>Where can I start to debug the problem?
>
>free -m output:
>
>              total       used       free     shared    buffers     cached
>Mem:          1001        986         14          0          1         79
>-/+ buffers/cache:        905         95
>Swap:         1996         49       1946
>
>I don't understand what's happening to the total 1GB memory. This is the free output that happened seconds before crash (I had swatch set up to log the statistics the moment it sees OOM messages). PS output doesn't show any process taking significant portion of memory either. Since this is happening only when users are using GFS heavily I suspect it is the problem. But how do I verify it? Is 1GB too small for a GFS server?
>
>I found that another user has seen the same problem before:
>https://www.redhat.com/archives/linux-cluster/2005-January/msg00099.html
>
>GFS setup was fine and all our tests passed. We then moved it to production and it immediately failed after running for two days. Your help is very much appreciated!! The problem seems to be reproducible. So if you need any logs I can rerun what our users did at the time of crash.
>
>Thanks,
>Raj
>
>================== Log ============================
>
>Apr  7 10:49:15 server1 kernel: Mem-info:
>Apr  7 10:49:15 server1 kernel: Zone:DMA freepages:  2792 min:     0 low:     0 high:     0
>Apr  7 10:49:15 server1 kernel: Zone:Normal freepages:   382 min:   766 low:  4031 high:  5791
>Apr  7 10:49:15 server1 kernel: Zone:HighMem freepages:   287 min:   255 low:   510 high:   765
>Apr  7 10:49:15 server1 kernel: Free pages:        3461 (   287 HighMem)
>Apr  7 10:49:15 server1 kernel: ( Active: 22389/6071, inactive_laundry: 889, inactive_clean: 943, free: 3461 )
>Apr  7 10:49:15 server1 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2792
>Apr  7 10:49:15 server1 kernel:   aa:6 ac:13 id:292 il:43 ic:0 fr:382
>Apr  7 10:49:15 server1 kernel:   aa:15159 ac:7211 id:5769 il:856 ic:943 fr:287
>Apr  7 10:49:15 server1 kernel: 2*4kB 1*8kB 3*16kB 3*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11168kB) Apr  7 10:49:15 server1 kernel: 0*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1528kB) Apr  7 10:49:15 server1 kernel: 33*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1148kB) Apr  7 10:49:15 server1 kernel: Swap cache: add 1777, delete 1292, find 20425/20536, race 0+0 Apr  7 10:49:15 server1 kernel: 218499 pages of slabcache Apr  7 10:49:15 server1 kernel: 216 pages of kernel stacks Apr  7 10:49:16 server1 kernel: 0 lowmem pagetables, 489 highmem pagetables
>Apr  7 10:49:16 server1 kernel: Free swap:       2038872kB
>Apr  7 10:49:16 server1 kernel: 262138 pages of RAM Apr  7 10:49:16 server1 kernel: 32762 pages of HIGHMEM Apr  7 10:49:16 server1 kernel: 5780 reserved pages Apr  7 10:49:16 server1 kernel: 16752 pages shared Apr  7 10:49:16 server1 kernel: 485 pages swap cached Apr  7 10:49:16 server1 kernel: Out of Memory: Killed process 21188 (sshd).
>Apr  7 10:49:16 server1 kernel: Out of Memory: Killed process 21188 (sshd).
>Apr  7 10:49:20 server1 kernel: Mem-info:
>Apr  7 10:49:20 server1 kernel: Zone:DMA freepages:  2792 min:     0 low:     0 high:     0
>Apr  7 10:49:20 server1 kernel: Zone:Normal freepages:   382 min:   766 low:  4031 high:  5791
>Apr  7 10:49:20 server1 kernel: Zone:HighMem freepages:   291 min:   255 low:   510 high:   765
>Apr  7 10:49:20 server1 kernel: Free pages:        3465 (   291 HighMem)
>Apr  7 10:49:20 server1 kernel: ( Active: 21743/6636, inactive_laundry: 896, inactive_clean: 1049, free: 3465 )
>Apr  7 10:49:20 server1 kernel:   aa:0 ac:0 id:0 il:0 ic:0 fr:2792
>Apr  7 10:49:20 server1 kernel:   aa:6 ac:36 id:265 il:40 ic:0 fr:382
>Apr  7 10:49:20 server1 kernel:   aa:14479 ac:7222 id:6365 il:862 ic:1049 fr:291
>Apr  7 10:49:20 server1 kernel: 2*4kB 1*8kB 3*16kB 3*32kB 0*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 2*4096kB = 11168kB) Apr  7 10:49:20 server1 kernel: 28*4kB 1*8kB 2*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1528kB) Apr  7 10:49:20 server1 kernel: 37*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1164kB) Apr  7 10:49:20 server1 kernel: Swap cache: add 1777, delete 1292, find 20425/20536, race 0+0 Apr  7 10:49:21 server1 kernel: 218570 pages of slabcache Apr  7 10:49:21 server1 kernel: 196 pages of kernel stacks Apr  7 10:49:21 server1 kernel: 0 lowmem pagetables, 404 highmem pagetables
>Apr  7 10:49:21 server1 kernel: Free swap:       2038872kB
>Apr  7 10:49:21 server1 kernel: 262138 pages of RAM Apr  7 10:49:21 server1 kernel: 32762 pages of HIGHMEM Apr  7 10:49:21 server1 kernel: 5780 reserved pages Apr  7 10:49:22 server1 kernel: 13904 pages shared Apr  7 10:49:22 server1 kernel: 485 pages swap cached Apr  7 10:49:22 server1 kernel: Out of Memory: Killed process 5215 (xfs).
>Apr  7 10:49:22 server1 kernel: Out of Memory: Killed process 5215 (xfs).
>.........
>............
>--
>Linux-cluster mailing list
>Linux-cluster at redhat.com
>http://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20050429/1b72bd8a/attachment.htm>