[Linux-cluster] strange slowness of ls with 1 newly created file on gfs 1 or 2
Christopher Barry
Christopher.Barry at qlogic.com
Wed Jul 11 23:44:47 UTC 2007
On Wed, 2007-07-11 at 18:03 -0400, Wendy Cheng wrote:
> Christopher Barry wrote:
> > On Wed, 2007-07-11 at 13:01 -0400, Wendy Cheng wrote:
> >
> >> Christopher Barry wrote:
> >>
> >>> On Tue, 2007-07-10 at 22:23 -0400, Wendy Cheng wrote:
> >>>
> >>>
> >>>> Pavel Stano wrote:
> >>>>
> >>>>
> >>>>
> >>>>> and then run touch on node 1:
> >>>>> serpico# touch /d/0/test
> >>>>>
> >>>>> and ls on node 2:
> >>>>> dinorscio:~# time ls /d/0/
> >>>>> test
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> What have you expected from a cluster filesystem ? When you touch a file
> >>>> on node 1, it is a "create" that requires at least 2 exclusive locks
> >>>> (directory lock and the file lock itself, among many other things). On a
> >>>> local filesystem such as ext3, disk activities are delayed due to
> >>>> filesystem cache where "touch" writes the data into cache and "ls" reads
> >>>> it from cache on the very same node - all memory operations. On cluster
> >>>> filesystem, when you do an "ls" on node 2, node 2 needs to ask node 1 to
> >>>> release the locks (few ping-pong messages between two nodes and lock
> >>>> managers via network), the contents inside node 1's cache need to get
> >>>> synced to the shared storage. After node 2 gets the locks, it has to
> >>>> read contents from the disk.
> >>>>
> >>>> I hope the above explanation is clear.
> >>>>
> >>>>
> >>>>
> >>>>> and last thing, i try gfs2, but same result
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> -- Wendy
> >>>>
> >>>>
> >>> This seems a little bit odd to me. I'm running a RH 7.3 cluster,
> >>> pre-redhat Sistina GFS, lock_gulm, 1GB FC shared disk, and have been
> >>> since ~2002.
> >>>
> >>> Here's the timing I get for the same basic test between two nodes:
> >>>
> >>> [root at sbc1 root]# cd /mnt/gfs/workspace/cbarry/
> >>> [root at sbc1 cbarry]# mkdir tst
> >>> [root at sbc1 cbarry]# cd tst
> >>> [root at sbc1 tst]# time touch testfile
> >>>
> >>> real 0m0.094s
> >>> user 0m0.000s
> >>> sys 0m0.000s
> >>> [root at sbc1 tst]# time ls -la testfile
> >>> -rw-r--r-- 1 root root 0 Jul 11 12:20 testfile
> >>>
> >>> real 0m0.122s
> >>> user 0m0.010s
> >>> sys 0m0.000s
> >>> [root at sbc1 tst]#
> >>>
> >>> Then immediately from the other node:
> >>>
> >>> [root at sbc2 root]# cd /mnt/gfs/workspace/cbarry/
> >>> [root at sbc2 cbarry]# time ls -la tst
> >>> total 12
> >>> drwxr-xr-x 2 root root 3864 Jul 11 12:20 .
> >>> drwxr-xr-x 4 cbarry cbarry 3864 Jul 11 12:20 ..
> >>> -rw-r--r-- 1 root root 0 Jul 11 12:20 testfile
> >>>
> >>> real 0m0.088s
> >>> user 0m0.010s
> >>> sys 0m0.000s
> >>> [root at sbc2 cbarry]#
> >>>
> >>>
> >>> Now, you cannot tell me 10 seconds is 'normal' for a clustered fs. That
> >>> just does not fly. My guess is DLM is causing problems.
> >>>
> >>>
> >>>
> >> From previous post, we really can't tell since the network and disk
> >> speeds are variables and unknown. However, look at your data:
> >>
> >> local "ls" is 0.122s
> >> remote "ls" is 0.088s
> >>
> >> I bet the disk flushing happened during first "ls" (and different base
> >> kernels treat their dirty data flush and IO scheduling differently). I
> >> can't be convinced that DLM is an issue - unless the experiment has
> >> collected enough sample that has its statistical significance.
> >>
> >> -- Wendy
> >>
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >
> >
>
> ok :) I admire your curiosity. I'm not saying 10 seconds is ok. I'm
> saying one single command doesn't imply anything (since there are so
> many variables there). You need to try out few more runs before
> concluding anything is wrong.
> > Where is all the time being spent? Certainly, it should not take 10
> > seconds.
> >
> > Let me see if I get the series of events correct here, and you can
> > correct me where I'm wrong.
> >
> > Node1:
> > touch is run, and asks (indirectly) for 2 exclusive write locks.
> > dlm grants the locks.
> > File is created into cache.
> > locks are released (now?)
> >
> Not necessarily (if there is no other request pending, GFS caches the
> locks assuming next request will be most likely from this node).
> > local ls is run, and asks for read lock
> > dlm grants lock.
> > reads cache.
> > returns results to screen
> > lock is released
> >
> In your case, the lock was downgraded from write to read; file was
> flushed; all within local node before remote "ls" was issued. This is
> different from previous post. Previous poster didn't do an "ls" so he
> paid the price for extra network traffic, plus the synchronization
> (wait) cost (waiting for lock manager to communicate and file sync to
> disk). And remember lock manager is implemented as daemon. You send the
> daemon a message and it may not be waken up in time to receive the
> message . A lot of variables there.
> > Node2:
> > remote ls is run, and asks for read lock
> > ... what happens here?
> >
> DLM sends messages (via network) to node 1 to ask for lock. After lock
> is granted, GFS reads the file from the disk.
> > I think your saying dlm looks at the lock request, and says I can't give
> > it to you, because the buffer has not been sync'd to disk yet.
> >
> No, DLM says I need to ask whoever is holding the lock to release the
> lock. And GFS waits until lock is granted. Whoever owns the lock needs
> to do its action accordingly. If it is an exclusive lock, the file needs
> to get flushed before the lock can be shared.
> > Does node2 wait, and retry asking for the lock after some time period,
> > and do this in loop? Does the dlm on Node1 request the data be sync'd so
> > that the requesting Node2 can access the data faster?
> >
> It is not in a loop. It is an event-wait-wakeup logic.
> > If Pavel used dd to create a file, rather than touch, with a size larger
> > than the buffer, and then used ls on Node2, would this show far better
> > performance? Is the real issue the corner-case of a 0 byte file being
> > created?
> >
> No, I don't think so. Not sure how "dd" is implemented internally from
> top of my head. However, remember "create" competes with "ls" for
> directory lock. But a file write itself doesn't compete with "ls" since
> it only requires file lock. On the other hand, "ls -la" is another
> story - it requires file size so it will need the file (inode locks). So
> there is another variation there.
> > Basically, I think you're saying that the kernel is keeping the 0 byte
> > touched file in cache, and GFS and/or dlm cannot help with this
> > situation. Is that correct?
> >
> >
> No, I'm not saying that. Again, I'm saying you need to run the command
> few times, instead of one time shot before concluding anything. Since
> there are simply too many variations and variables under-neath these
> simple "touch" and "ls" commands in a cluster environment.
>
> -- Wendy
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
Thank You for the lesson Wendy. ;^)
Another question you'll likely know the answer to. Is there a preferred
IO Scheduler to use with GFS?
--
Regards,
-C
More information about the Linux-cluster
mailing list