[Linux-cluster] PVFS going Wild

Mon Oct 3 21:49:10 UTC 2005

Hey Guys,

I just took over a couple of clusters for a sysadmin that left the company.
Unfortunately, the hand-off was less than informative. <sigh> So, I've got
an old linux cluster, still well-used, with a PVFS filesystem mounted at
/work. I'm new to clustering, and I sure as hell don't know much about it,
but I've got a sick puppy here. All points to the PVFS filesystem.

lsof: WARNING: can't stat() pvfs file system /work
Output information may be incomplete.

In /var/log/messages:

Oct 3 13:51:34 elvis PAM_pwdb[24431]: (su) session opened for user deb_r by
deb(uid=2626)
Oct 3 13:51:49 elvis kernel: (./ll_pvfs.c, 361): ll_pvfs_getmeta failed on
downcall for 192.168.1.102:300 <http://192.168.1.102:300>
0/pvfs-meta
Oct 3 13:51:49 elvis kernel: (./ll_pvfs.c, 361): ll_pvfs_getmeta failed on
downcall for 192.168.1.102:300 <http://192.168.1.102:300>
0/pvfs-meta/manaa/DFTBNEW
Oct 3 14:16:48 elvis kernel: (./ll_pvfs.c, 409): ll_pvfs_statfs failed on
downcall for 192.168.1.102:3000 <http://192.168.1.102:3000>
/pvfs-meta
Oct 3 14:16:elvis kernel: (./inode.c, 321): pvfs_statfs failed

So the

Linux elvis 2.2.19-13.beosmp #1 SMP Tue Aug 21 20:04:44 EDT 2001 i686
unknown

Red Hat Linux release 6.2 (Zoot)

Can't access /work from the master or any nodes,

elvis [49#] ls /work
ls: /work: Too many open files

I ran a script in /usr/bin called pvfs_client_stop.sh - which killed all the
pvfs daemons, etc

#!/bin/tcsh

# Phil Carns
# pcarns at hubcap.clemson.edu
#
# This is an example script for how to get Scyld Beowulf cluster nodes
# to mount a PVFS file system.

set PVFSD = "/usr/sbin/pvfsd"
set PVFSMOD = "pvfs"
set PVFS_CLIENT_MOUNT_DIR = "/work"
set MOUNT_PVFS = "/sbin/mount.pvfs"

# unmount the file system locally and on all of the slave nodes
/bin/umount $PVFS_CLIENT_MOUNT_DIR
bpsh -pad /bin/umount $PVFS_CLIENT_MOUNT_DIR

# kill all of the pvfsd client daemons
/usr/bin/killall pvfsd

# remove the pvfs module on the local and the slave nodes
/sbin/rmmod $PVFSMOD
bpsh -pad /sbin/rmmod $PVFSMOD

Then I ran pvfs_client_start.sh /work, which seemed to work, except it never
exited...

#!/bin/tcsh

# Phil Carns
# pcarns at hubcap.clemson.edu
#
# This is an example script for how to get Scyld Beowulf cluster nodes
# to mount a PVFS file system.

set PVFSD = "/usr/sbin/pvfsd"
set PVFSMOD = "pvfs"
set PVFS_CLIENT_MOUNT_DIR = "/work"
set MOUNT_PVFS = "/sbin/mount.pvfs"
set PVFS_META_DIR = `bpctl -M -a`:$1

if $1 == "" then
 echo "usage: pvfs_client_start.sh <meta dir>"
 echo "(Causes every machine in the cluster to mount the PVFS file system)"
 exit -1
endif

# insert the pvfs module on the local and slave nodes
/sbin/modprobe $PVFSMOD
bpsh -pad /sbin/modprobe $PVFSMOD

# start the pvfsd client daemon on the local and slave nodes
$PVFSD
bpsh -pad $PVFSD

# actually mount the file system locally and on all of the slave nodes
$MOUNT_PVFS $PVFS_META_DIR $PVFS_CLIENT_MOUNT_DIR
bpsh -pad $MOUNT_PVFS $PVFS_META_DIR $PVFS_CLIENT_MOUNT_DIR

This seemed to work (well, it restarted daemons and such, but I still can't
get into /work and getting resource busy and:

mount.pvfs: Device or resource busy
mount.pvfs: server 192.168.1.102 <http://192.168.1.102> alive, but mount
failed (invalid metadata directory name?)

Comments? Useful ideas? A good joke???

dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051003/84992ef7/attachment.htm>