[Linux-cluster] stuck processes on GFS partition?

Matt Brookover mbrookov at mines.edu
Thu Dec 15 21:54:14 UTC 2005


I was the first to get a process stuck in a device wait.  I created a
directory in the root of the file system and then tried to do an ls. 
The ls got stuck.  From the looks of the logs, the problems had started
the day before, but went unnoticed until I did an ls.  The new directory
worked from other nodes that had mounted that GFS file system.  

Unfortunately, I do not believe that the server was doing much of any
thing at the time.  There were a few users, mostly reading email, and
not using the file system that had the problem.  The partition in
question is used for mail lists and a dumping ground for backups for 6
other servers.  The backups were not running at the time the first
gfs_releasepage() message was logged. The mail lists are just test lists
and not in use yet. The backups transfer about 10GB of data in 12 to 15
files between 3am and 5am every day.  The backups are transfered by scp
(the only path through a firewall). The backups that night ran without
any problems, both the copy from the remote servers and a copy of that
file system to tape.

If/when it happens again, I will try to have a better idea of what was
going on at the time.

The server in question had been up for over 30 days when the problem
started.

Thank you

Matt

On Thu, 2005-12-15 at 14:24, Andrew C. Dingman wrote:

> On Mon, 2005-12-12 at 14:51 -0700, Matt Brookover wrote:
> > This looks like a similar problem to the one described in bugzilla
> > 160409.  It does not look like there ever was a solution. 
> > 
> 
> It does look similar, and there was no solution. We were never able to
> re-produce the problem by any method other than putting it into
> production. I think the theory we ended up with was that there was some
> sort of lock contention problem, possibly having to do with the network
> here. It was just a theory, though. We never managed to prove anything.
> 
> Do you know what you did to trigger it? I assume something other than
> 300 people running jBase applications?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20051215/40634021/attachment.htm>


More information about the Linux-cluster mailing list