[Linux-cluster] GFS2 DLM problem on NVMes

Mon Nov 20 10:40:54 UTC 2017

Hi,

On 20/11/17 04:23, 성백재 wrote:
>
> Hello, List.
>
> We are developing storage systems using 10 NVMes (current test set).
>
> Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on 
> Reads).
>
> However, a GFS2 DLM problem occurred. The problem is that each host 
> frequently reports “dlm: gfs2: send_repeat_remove” kernel messages, 
> and I/O throughput becomes unstable and low.
>
> I found a GFS2 commit message about “send_repeat_remove” function.
>
> (https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?id=96006ea6d4eea73466e90ef353bf34e507724e77)
>
> Information about the test environment.
>
> Four hosts share 10 NVMes, and each host deploys CLVM/GFS2 on top of 
> the cluster MD RAID1 + MD RAID0.
>
> GFS2 has 2,000 directories, each with 1,900 media files (3 MB on average).
>
> Each host runs 20 threads of NGINX, and each thread randomly reads 
> media files on demand.
>
> The Linux kernel version is 4.11.8.
>
> Can you offer suggestions or directions to solve these problems?
>
> Thank you in advance :)
>
> Best regards,
>
> /Jay Sung
>
I'm copying in our DLM experts. It would be good to open a bug at Red 
Hat's bugzilla to track this issue (and a customer case too, if you are 
a customer). It looks like something that will need some investigation 
to get to the bottom of what is going on. I suspect that a tcpdump of 
the DLM traffic when the issue occurs would be the first thing to try, 
so that we can try and match the message to the protocol dump. That may 
not be easy since I suspect that there is a large quantity of DLM 
traffic in your set up, and that will make finding the specific messages 
more tricky.

Just out of interest, what kind of network is this running over? How 
much bandwidth is DLM taking up?

Steve.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20171120/3733fe63/attachment.htm>