[Linux-cluster] mixing OS versions?

Sat Apr 26 00:06:51 UTC 2014

Hi,

On Friday 25 of April 2014 12:42:59 Steven Whitehouse wrote:
> Hi,
> 
> On 24/04/14 17:29, Alan Brown wrote:
> > On 30/03/14 12:34, Steven Whitehouse wrote:
> >> Well that is not entirely true. We have done a great deal of
> >> investigation into this issue. We do test quotas (among many other
> >> things) on each release to ensure that they are working. Our tests have
> >> all passed correctly, and to date you have provided the only report of
> >> this particular issue via our support team. So it is certainly not
> >> something that lots of people are hitting.
> > 
> > Someone else reported it on this list (on centos), so we're not an
> > isolated case.
> > 
> >> We do now have a good idea of where the issue is. However it is clear
> >> that simply exceeding quotas is not enough to trigger it. Instead quotas
> >> need to be exceeded in a particular way.
> > 
> > My suspicion is that it's some kind of interaction between quotas and
> > NFS, but it'd be good if you could provide a fuller explanation.
> 
> Yes, thats what we thought to start with... however that turned out to
> be a bit of a red herring. Or at least the issue has nothing
> specifically to do with NFS. The problem was related to when quota was
> exceeded, and specifically what operation was in progress. You could
> write to files as often as you wanted to, and exceeding quota would be
> handled correctly. The problem was a specific code path within the inode
> creation code, if it didn't result in quota being exceeded on that one
> specific code path, then everything would work as expected.

could you please provide a (somewhat reliable) test case to reproduce this 
bug? I have looked at the patch, and found nothing obviously related to quotas 
(it seems the patch only changes the fail-path of posix_acl_create() call, 
which doesn't appear to have nothing to do with quotas)

I have been facing a possibly quota-related oops in GFS2 for some time, which 
I am unable to reproduce without switching my cluster to production use (which 
means potentialy facing the anger of my users, which I'd rather not do without 
at least a chance of the issue being fixed).

sadly, I don't have RedHat support subscription (nor do I use RHEL or 
derivates), my kernel is mostly upstream.

thanks
Pavel Herrmann

> 
> Also, quite often when the problem did appear, it did not actually
> trigger a problem until later, making it difficult to track down.
> 
> You are correct that someone else reported the issue on the list,
> however I'm not aware of any other reports beyond yours and theirs.
> Also, this was specific to certain versions of GFS2, and not something
> that relates to all versions.
> 
> The upstream patch is here:
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/gfs
> 2?id=059788039f1e6343f34f46d202f8d9f2158c2783
> 
> It should be available in RHEL shortly - please ping support via the
> ticket for updates,
> 
> Steve.
> 
> >> Returning to the original point however, it is certainly not recommended
> >> to have mixed RHEL or CentOS versions running in the same cluster. It is
> >> much better to keep everything the same, even though the GFS2 on-disk
> >> format has not changed between the versions.
> > 
> > More specfically (for those who are curious): Whilst the on-disk
> > format has not changed between EL5 and EL6, the way that RH cluster
> > members communicate with each other has.
> > 
> > I ran a quick test some time back and the 2 different OS cluster
> > versions didn't see each other for LAN heartbeating.