[Linux-cluster] strange slowness of ls with 1 newly created file on gfs 1 or 2

Wed Jul 11 23:44:47 UTC 2007

On Wed, 2007-07-11 at 18:03 -0400, Wendy Cheng wrote:
> Christopher Barry wrote:
> > On Wed, 2007-07-11 at 13:01 -0400, Wendy Cheng wrote:
> >   
> >> Christopher Barry wrote:
> >>     
> >>> On Tue, 2007-07-10 at 22:23 -0400, Wendy Cheng wrote:
> >>>   
> >>>       
> >>>> Pavel Stano wrote:
> >>>>
> >>>>     
> >>>>         
> >>>>> and then run touch on node 1:
> >>>>> serpico# touch /d/0/test
> >>>>>
> >>>>> and ls on node 2:
> >>>>> dinorscio:~# time ls /d/0/
> >>>>> test
> >>>>>
> >>>>>  
> >>>>>
> >>>>>       
> >>>>>           
> >>>> What have you expected from a cluster filesystem ? When you touch a file 
> >>>> on node 1, it is a "create" that requires at least 2 exclusive locks 
> >>>> (directory lock and the file lock itself, among many other things). On a 
> >>>> local filesystem such as ext3, disk activities are delayed due to 
> >>>> filesystem cache where "touch" writes the data into cache and "ls" reads 
> >>>> it from cache on the very same node - all memory operations.  On cluster 
> >>>> filesystem, when you do an "ls" on node 2, node 2 needs to ask node 1 to 
> >>>> release the locks (few ping-pong messages between two nodes and lock 
> >>>> managers via network), the contents inside node 1's cache need to get 
> >>>> synced to the shared storage. After node 2 gets the locks, it  has to 
> >>>> read contents from the disk.
> >>>>
> >>>> I hope the above explanation is clear.
> >>>>
> >>>>     
> >>>>         
> >>>>> and last thing, i try gfs2, but same result
> >>>>>
> >>>>>
> >>>>>  
> >>>>>
> >>>>>       
> >>>>>           
> >>>> -- Wendy
> >>>>     
> >>>>         
> >>> This seems a little bit odd to me. I'm running a RH 7.3 cluster,
> >>> pre-redhat Sistina GFS, lock_gulm, 1GB FC shared disk, and have been
> >>> since ~2002.
> >>>
> >>> Here's the timing I get for the same basic test between two nodes:
> >>>
> >>> [root at sbc1 root]# cd /mnt/gfs/workspace/cbarry/
> >>> [root at sbc1 cbarry]# mkdir tst
> >>> [root at sbc1 cbarry]# cd tst
> >>> [root at sbc1 tst]# time touch testfile
> >>>
> >>> real    0m0.094s
> >>> user    0m0.000s
> >>> sys     0m0.000s
> >>> [root at sbc1 tst]# time ls -la testfile
> >>> -rw-r--r--    1 root     root            0 Jul 11 12:20 testfile
> >>>
> >>> real    0m0.122s
> >>> user    0m0.010s
> >>> sys     0m0.000s
> >>> [root at sbc1 tst]#
> >>>
> >>> Then immediately from the other node:
> >>>
> >>> [root at sbc2 root]# cd /mnt/gfs/workspace/cbarry/
> >>> [root at sbc2 cbarry]# time ls -la tst
> >>> total 12
> >>> drwxr-xr-x    2 root     root         3864 Jul 11 12:20 .
> >>> drwxr-xr-x    4 cbarry   cbarry       3864 Jul 11 12:20 ..
> >>> -rw-r--r--    1 root     root            0 Jul 11 12:20 testfile
> >>>
> >>> real    0m0.088s
> >>> user    0m0.010s
> >>> sys     0m0.000s
> >>> [root at sbc2 cbarry]#
> >>>
> >>>
> >>> Now, you cannot tell me 10 seconds is 'normal' for a clustered fs. That
> >>> just does not fly. My guess is DLM is causing problems.
> >>>
> >>>   
> >>>       
> >>  From previous post, we really can't tell since the network and disk 
> >> speeds are variables and unknown. However, look at your data:
> >>
> >> local "ls" is 0.122s
> >> remote "ls" is 0.088s
> >>
> >> I bet the disk flushing happened during first "ls" (and different base 
> >> kernels treat their dirty data flush and IO scheduling differently). I 
> >> can't be convinced that DLM is an issue - unless the experiment has 
> >> collected enough sample that has its statistical significance.
> >>
> >> -- Wendy
> >>
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>     
> >
> >   
> 
> ok :) I admire your curiosity. I'm not saying 10 seconds is ok. I'm 
> saying one single command doesn't imply anything (since there are so 
> many variables there). You need to try out few more runs before 
> concluding anything is wrong.
> > Where is all the time being spent? Certainly, it should not take 10
> > seconds.
> >
> > Let me see if I get the series of events correct here, and you can
> > correct me where I'm wrong.
> >
> > Node1:
> > touch is run, and asks (indirectly) for 2 exclusive write locks.
> > dlm grants the locks.
> > File is created into cache.
> > locks are released (now?)
> >   
> Not necessarily (if there is no other request pending, GFS caches the 
> locks assuming next request will be most likely from this node).
> > local ls is run, and asks for read lock
> > dlm grants lock.
> > reads cache.
> > returns results to screen
> > lock is released
> >   
> In your case, the lock was downgraded from write to read; file was 
> flushed; all within local node before remote "ls" was issued. This is 
> different from previous post. Previous poster didn't do an "ls" so he 
> paid the price for extra network traffic, plus the synchronization 
> (wait) cost (waiting for lock manager to communicate and file sync to 
> disk). And remember lock manager is implemented as daemon. You send the 
> daemon a message and it may not be waken up in time to receive the 
> message . A lot of variables there.
> > Node2:
> > remote ls is run, and asks for read lock
> > ... what happens here?
> >   
> DLM sends messages (via network) to node 1 to ask for lock. After lock 
> is granted, GFS reads the file from the disk.
> > I think your saying dlm looks at the lock request, and says I can't give
> > it to you, because the buffer has not been sync'd to disk yet.
> >   
> No, DLM says I need to ask whoever is holding the lock to release the 
> lock. And GFS waits until lock is granted. Whoever owns the lock needs 
> to do its action accordingly. If it is an exclusive lock, the file needs 
> to get flushed before the lock can be shared.
> > Does node2 wait, and retry asking for the lock after some time period,
> > and do this in loop? Does the dlm on Node1 request the data be sync'd so
> > that the requesting Node2 can access the data faster?
> >   
> It is not in a loop. It is an event-wait-wakeup logic.
> > If Pavel used dd to create a file, rather than touch, with a size larger
> > than the buffer, and then used ls on Node2, would this show far better
> > performance? Is the real issue the corner-case of a 0 byte file being
> > created?
> >   
> No, I don't think so. Not sure how "dd" is implemented internally from 
> top of my head. However, remember "create" competes with "ls" for 
> directory lock. But a file write itself doesn't compete with "ls" since 
> it only requires file lock.  On the other hand, "ls -la" is another 
> story - it requires file size so it will need the file (inode locks). So 
> there is another variation there.
> > Basically, I think you're saying that the kernel is keeping the 0 byte
> > touched file in cache, and GFS and/or dlm cannot help with this
> > situation. Is that correct?
> >
> >   
> No, I'm not saying that. Again, I'm saying you need to run the command 
> few times, instead of one time shot before concluding anything. Since 
> there are simply too many variations and variables under-neath these 
> simple "touch" and "ls" commands in a cluster environment.
> 
> -- Wendy
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Thank You for the lesson Wendy. ;^)

Another question you'll likely know the answer to. Is there a preferred
IO Scheduler to use with GFS? 

-- 
Regards,
-C