[Linux-cluster] R: nfs cluster, problem with delete file in the failover case

Wed May 13 11:38:51 UTC 2015

Sounds like the process that has the file create while you are moving it to another node still open. Meaning you are deleting the file and doing failover at the same time.
This has not things to do with your cluster setup.

I believed , you can run lsof command on the system that you're seeing the disk size is still not clean up. Then grep for deteled arg.
You may see the process number that is still there. Then kill that process and it will clean up the file handle process that is still open.

That is how I see in your problem. I don't think it has any things to do with OS cluster.

Vinh

-----Original Message-----
From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of sella gianpietro
Sent: Wednesday, May 13, 2015 7:06 AM
To: 'linux clustering'
Subject: [Linux-cluster] R: nfs cluster, problem with delete file in the failover case

this is the inodes number in the exported folder of the volume in the server before write file in the client:

[root at cld-blu-13 nova]# du --inodes
2       .

this is the used block:

[root at cld-blu-13 nova]# df -T
Filesystem                            Type      1K-blocks    Used  Available
Use% Mounted on
/dev/mapper/nfsclustervg-nfsclusterlv xfs      1152878588   33000 1152845588
1% /nfscluster

after write file in the client with umount/mount during writing:

[root at cld-blu-13 nova]# du --inodes
3       .

[root at cld-blu-13 nova]# df -T
Filesystem                            Type      1K-blocks     Used
Available Use% Mounted on
/dev/mapper/nfsclustervg-nfsclusterlv xfs      1152878588 21004520
1131874068   2% /nfscluster

thi is correct.
now delete file:

[root at cld-blu-13 nova]# du --inodes
2       .

the number of the inodes is correct (from 3 to 2).

[root at cld-blu-13 nova]# df -T
Filesystem                            Type      1K-blocks     Used
Available Use% Mounted on
/dev/mapper/nfsclustervg-nfsclusterlv xfs      1152878588 21004520
1131874068   2% /nfscluster

the number of used block is not correct.
Do not return to initial value 33000

-----Messaggio originale-----
Da: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] Per conto di J. Bruce Fields
Inviato: martedì 12 maggio 2015 17.25
A: linux clustering
Oggetto: Re: [Linux-cluster] nfs cluster, problem with delete file in the failover case

On Tue, May 12, 2015 at 12:37:10AM +0200, gianpietro.sella at unipd.it wrote:
> > On Sun, May 10, 2015 at 11:28:25AM +0200, gianpietro.sella at unipd.it
wrote:
> >> Hi, sorry for my bad english.
> >> I testing nfs cluster active/passsive (2 nodes).
> >> I use the next instruction for nfs:
> >>
> >>
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/htm
l/High_Availability_Add-On_Administration/s1-resourcegroupcreatenfs-HAAA.htm
l
> >>
> >> I use centos 7.1 on the nodes.
> >> The 2 node of the cluster share the same iscsi volume.
> >> The nfs cluster is very good.
> >> I have only one problem.
> >> I mount the nfs cluster exported folder on my client node (nfsv3 
> >> protocol).
> >> I write on the nfs folder an big data file (70GB):
> >> dd if=/dev/zero bs=1M count=70000 > /Instances/output.dat Before 
> >> write is finished I put the active node in standby status.
> >> then the resource migrate in the other node.
> >> when the dd write finish the file is ok.
> >> I delete the file output.dat.
> >
> > So, the dd and the later rm are both run on the client, and the rm 
> > after the dd has completed and exited?  And the rm doesn't happen 
> > till after the first migration is completely finished?  What version 
> > of NFS are you using?
> >
> > It sounds like a sillyrename problem, but I don't see the explanation.
> >
> > --b.
> 
> 
> Hi Bruce, thank for your answer.
> yes the dd command and the rm command (all on the client node) finish 
> without error.
> I use nfsv3, but is the same with nfsv4 protocol.
> the s.o. is centos 7.1, the nfs package is nfs-utils-1.3.0-0.8.el7.x86_64.
> the pacemaker configuration is:
> 
> pcs resource create nfsclusterlv LVM volgrpname=nfsclustervg 
> exclusive=true --group nfsclusterha
> 
> pcs resource create nfsclusterdata Filesystem 
> device="/dev/nfsclustervg/nfsclusterlv" directory="/nfscluster"
> fstype="ext4" --group nfsclusterha
> 
> pcs resource create nfsclusterserver nfsserver 
> nfs_shared_infodir=/nfscluster/nfsinfo nfs_no_notify=true --group 
> nfsclusterha
> 
> pcs resource create nfsclusterroot exportfs
> clientspec=192.168.61.0/255.255.255.0 options=rw,sync,no_root_squash 
> directory=/nfscluster/exports fsid=0 --group  nfsclusterha
> 
> pcs resource create nfsclusternova exportfs
> clientspec=192.168.61.0/255.255.255.0 options=rw,sync,no_root_squash 
> directory=/nfscluster/exports/nova fsid=1 -- group nfsclusterha
> 
> pcs resource create nfsclusterglance exportfs
> clientspec=192.168.61.0/255.255.255.0 options=rw,sync,no_root_squash 
> directory=/nfscluster/exports/glance fsid=
> 2 --group nfsclusterha
> 
> pcs resource create nfsclustervip IPaddr2 ip=192.168.61.180
cidr_netmask=24
> --group nfsclusterha
> 
> pcs resource create nfsclusternotify nfsnotify 
> source_host=192.168.61.180 --group nfsclusterha
> 
> now I have done the next test.
> nfs cluster with 2 node.
> the first node in standby state.
> the second node in active state.
> I mount the empty (not used space) exported volume in the client with
nfsv3
> protocol (with nfs4 protocol is the same).
> I write on the client an big file (70GB) in the mount directory with 
> dd
(but
> is the same with cp command).
> while the command write the file I disable nfsnotify, Iaddr2, exportfs 
> and nfsserver resource in this order (pcs resource disable ...) and 
> next I enable the resource (pcs resource enable ...) in the reverse order.
> when disable resource writing freeze, when enable resource writing 
> restart without error.
> when the writing command is finished I delete the file.
> the mount directory is empty and the used space of exported volume is 
> 0, this is ok.
> now i repead the test.
> but now I disable/enable even the Filesystem resource:
> disable nfsnotify, Iaddr2, exportfs, nfsserver and Filesystem resource 
> (writing freeze) then enable in the reverse order (writing restart 
> without error).
> when writing command is finished I delete the file.
> now the mounted directory is empty (not file) but the used space is 
> not 0 but is 70GB.
> this is not ok.
> now I execute the next command on the active node of the cluster where 
> the volume is exported with nfs:
> mount -o remount /dev/nfsclustervg/nfsclusterlv where 
> /dev/nfsclustervg/nfsclusterlv is the exported volume (iscsi volume 
> configured with lvm).
> after this command the used space in the mounted directory of the 
> client
is
> 0, this is ok.
> I think that the problem is the Filesystem resource on the active node 
> of the cluster.
> but is very strange.

So, the only difference between the "good" and "bad" cases was the addition of the stop/start of the filesystem resource?  I assume that's equivalent to an umount/mount.

I guess the server's dentry for that file is hanging around for a little while for some reason.  We've run across at least one problem of that sort before (see d891eedbc3b1 "fs/dcache: allow d_obtain_alias() to return unhashed dentries").

In both cases after the restart the first operation the server will get for that file is a write with a filehandle, and it will have to look up that filehandle to find the file.  (Whereas without the restart the initial discovery of the file will be a lookup by name.)

In the "good" case the server already has a dentry cached for that file, in the "bad" case the umount/mount means that we'll be doing a cold-cache lookup of that filehandle.

I wonder if the test case can be narrowed down any further....  Is the large file necessary?  If it's needed only to ensure the writes are actually sent to the server promptly then it might be enough to do the nfs mount with -osync.

Instead of the cluster migration or restart, it might be possible to reproduce the bug just with a

	echo 2 >/proc/sys/vm/drop_caches

run on the server side while the dd is in progress--I don't know if that will reliably drop the one dentry, though.  Maybe do a few of those in a row.

--b.

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster