From bj.sung at sk.com  Mon Nov 20 04:23:35 2017
From: bj.sung at sk.com (=?ks_c_5601-1987?B?vLq56cDn?=)
Date: Mon, 20 Nov 2017 04:23:35 +0000
Subject: [Linux-cluster] GFS2 DLM problem on NVMes
Message-ID: <ac2d353553f045878f7fe3a9a3eb9b40@skt-tnetpmx1.SKT.AD>

Hello, List.

We are developing storage systems using 10 NVMes (current test set).
Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads).
However, a GFS2 DLM problem occurred. The problem is that each host frequently reports ?dlm: gfs2: send_repeat_remove? kernel messages, and I/O throughput becomes unstable and low.
I found a GFS2 commit message about ?send_repeat_remove? function.
(https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?id=96006ea6d4eea73466e90ef353bf34e507724e77)

Information about the test environment.
Four hosts share 10 NVMes, and each host deploys CLVM/GFS2 on top of the cluster MD RAID1 + MD RAID0.
GFS2 has 2,000 directories, each with 1,900 media files (3 MB on average).
Each host runs 20 threads of NGINX, and each thread randomly reads media files on demand.
The Linux kernel version is 4.11.8.

Can you offer suggestions or directions to solve these problems?
Thank you in advance :)

Best regards,
/Jay Sung

Jay Sung (Baegjae), Manager | Software Defined Storage Lab | SK Telecom Co., LTD.
bj.sung at sk.com | mobile: +82-10-2087-5637
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20171120/b6102828/attachment.htm>

From swhiteho at redhat.com  Mon Nov 20 10:40:54 2017
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 20 Nov 2017 10:40:54 +0000
Subject: [Linux-cluster] GFS2 DLM problem on NVMes
In-Reply-To: <ac2d353553f045878f7fe3a9a3eb9b40@skt-tnetpmx1.SKT.AD>
References: <ac2d353553f045878f7fe3a9a3eb9b40@skt-tnetpmx1.SKT.AD>
Message-ID: <535dc07b-a780-29be-1f1e-077e568041e4@redhat.com>

Hi,


On 20/11/17 04:23, ??? wrote:
>
> Hello, List.
>
> We are developing storage systems using 10 NVMes (current test set).
>
> Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on 
> Reads).
>
> However, a GFS2 DLM problem occurred. The problem is that each host 
> frequently reports ?dlm: gfs2: send_repeat_remove? kernel messages, 
> and I/O throughput becomes unstable and low.
>
> I found a GFS2 commit message about ?send_repeat_remove? function.
>
> (https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?id=96006ea6d4eea73466e90ef353bf34e507724e77)
>
> Information about the test environment.
>
> Four hosts share 10 NVMes, and each host deploys CLVM/GFS2 on top of 
> the cluster MD RAID1 + MD RAID0.
>
> GFS2 has 2,000 directories, each with 1,900 media files (3 MB on average).
>
> Each host runs 20 threads of NGINX, and each thread randomly reads 
> media files on demand.
>
> The Linux kernel version is 4.11.8.
>
> Can you offer suggestions or directions to solve these problems?
>
> Thank you in advance :)
>
> Best regards,
>
> /Jay Sung
>
I'm copying in our DLM experts. It would be good to open a bug at Red 
Hat's bugzilla to track this issue (and a customer case too, if you are 
a customer). It looks like something that will need some investigation 
to get to the bottom of what is going on. I suspect that a tcpdump of 
the DLM traffic when the issue occurs would be the first thing to try, 
so that we can try and match the message to the protocol dump. That may 
not be easy since I suspect that there is a large quantity of DLM 
traffic in your set up, and that will make finding the specific messages 
more tricky.

Just out of interest, what kind of network is this running over? How 
much bandwidth is DLM taking up?

Steve.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20171120/3733fe63/attachment.htm>

From teigland at redhat.com  Mon Nov 20 19:09:32 2017
From: teigland at redhat.com (David Teigland)
Date: Mon, 20 Nov 2017 13:09:32 -0600
Subject: [Linux-cluster] GFS2 DLM problem on NVMes
In-Reply-To: <ac2d353553f045878f7fe3a9a3eb9b40@skt-tnetpmx1.SKT.AD>
References: <ac2d353553f045878f7fe3a9a3eb9b40@skt-tnetpmx1.SKT.AD>
Message-ID: <20171120190932.GC29888@redhat.com>

> We are developing storage systems using 10 NVMes (current test set).
> Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads).

Does MD RAID10 work correctly under GFS2?  Does the RAID10 make use of the
recent md-cluster enhancements (which also use the dlm)?

> However, a GFS2 DLM problem occurred. The problem is that each host
> frequently reports dlm: gfs2: send_repeat_remove kernel messages,
> and I/O throughput becomes unstable and low.

send_repeat_remove is a mysterious corner case, related to the resource
directory becoming out of sync with the actual resource master.  There's
an inherent race in this area of the dlm which is hard to solve because
the same record (mapping of resource name to master nodeid) needs to be
changed consistently on two nodes.  Perhaps in the future the dlm could be
enhanced with some algorithm to do that better.  For now, it just repeats
the change (logging the message you see).  If the repeated operation is
working, then things won't be permanently stuck.

The most likely cause, it seems to me, is that the speed of storage
relative to the speed of the network is triggering pathological timing
issues in the dlm.  Try adjusting the "toss_secs" tunable, which controls
how long a node will hold on to an unused resource before giving up
mastery of it (the master change is what leads to the inconsistency
mentioned above.)

  echo 1000 > /sys/kernel/config/dlm/cluster/toss_secs

The default is 10, I'd try 100/1000/10000.  A number too large could have
negative consequences of not freeing enough dlm resources that will never
be used again, e.g. if you are deleting a lot of files.  Set this number
before mounting gfs for it to take effect.

In the past, I think that send_repeat_remove has tended to appear when
there's a huge volume of dlm messages, triggered by excessive caching done
by gfs when there's a large amount of system memory.  The huge volume of
dlm messages results in the messages appearing in unusual sequences,
reversing the usual cause-effect.

Dave


From echang at sk.com  Wed Nov 22 04:32:13 2017
From: echang at sk.com (Eric H. Chang)
Date: Wed, 22 Nov 2017 04:32:13 +0000
Subject: [Linux-cluster] GFS2 DLM problem on NVMes
Message-ID: <215748c05f8545deb3e498cd87ebe8c5@skt-tnetpmx2.SKT.AD>

Hi Dave and Steven,
Thank you for the assistance.

We made some progress here and would like to share with you.

#1.
We?ve set ?vm.vfs_cache_pressure? to zero and ran tests. As a result, we couldn?t see the same problem happening and observed that the slab grew slowly and saturated to 25GB during the overnight test. We will keep running test with this, but it?d be appeciated if you can advise any risks when we stick with this config.

#2.
We?ve tested with different ?toss_secs? as advised. When we configured it as 1000, we saw the ?send_repeat_remove? log after 1000sec. We can test with other values on ?toss_secs?, but we think it would have the same problem potentially when freeing up the slab after the configured sec.

Do our results make sense to you?

Best Regards,
Eric Chang(Hong-seok), Manager | Software Defined Storage Lab | SK Telecom Co., LTD.
echang at sk.com<mailto:echang at sk.com> | mobile: +82-10-4996-3690 | skype: ehschang


Re: [Linux-cluster] GFS2 DLM problem on NVMes

*From: David Teigland <teigland redhat com>
*To: bj sung sk com
*Cc: linux-cluster redhat com
*Subject: Re: [Linux-cluster] GFS2 DLM problem on NVMes
*Date: Mon, 20 Nov 2017 13:09:32 -0600

> We are developing storage systems using 10 NVMes (current test set).
> Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads).

Does MD RAID10 work correctly under GFS2?  Does the RAID10 make use of the
recent md-cluster enhancements (which also use the dlm)?

> However, a GFS2 DLM problem occurred. The problem is that each host
> frequently reports dlm: gfs2: send_repeat_remove kernel messages,
> and I/O throughput becomes unstable and low.

send_repeat_remove is a mysterious corner case, related to the resource
directory becoming out of sync with the actual resource master.  There's
an inherent race in this area of the dlm which is hard to solve because
the same record (mapping of resource name to master nodeid) needs to be
changed consistently on two nodes.  Perhaps in the future the dlm could be
enhanced with some algorithm to do that better.  For now, it just repeats
the change (logging the message you see).  If the repeated operation is
working, then things won't be permanently stuck.

The most likely cause, it seems to me, is that the speed of storage
relative to the speed of the network is triggering pathological timing
issues in the dlm.  Try adjusting the "toss_secs" tunable, which controls
how long a node will hold on to an unused resource before giving up
mastery of it (the master change is what leads to the inconsistency
mentioned above.)

  echo 1000 > /sys/kernel/config/dlm/cluster/toss_secs

The default is 10, I'd try 100/1000/10000.  A number too large could have
negative consequences of not freeing enough dlm resources that will never
be used again, e.g. if you are deleting a lot of files.  Set this number
before mounting gfs for it to take effect.

In the past, I think that send_repeat_remove has tended to appear when
there's a huge volume of dlm messages, triggered by excessive caching done
by gfs when there's a large amount of system memory.  The huge volume of
dlm messages results in the messages appearing in unusual sequences,
reversing the usual cause-effect.

Dave


Re: [Linux-cluster] GFS2 DLM problem on NVMes

*From: Steven Whitehouse <swhiteho redhat com>
*To: linux-cluster redhat com, Mark Ferrell <mferrell redhat com>, David Teigland <teigland redhat com>
*Subject: Re: [Linux-cluster] GFS2 DLM problem on NVMes
*Date: Mon, 20 Nov 2017 10:40:54 +0000


Hi,

On 20/11/17 04:23, ??? wrote:


Hello, List.

We are developing storage systems using 10 NVMes (current test set).

Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads).
However, a GFS2 DLM problem occurred. The problem is that each host frequently reports ?dlm: gfs2: send_repeat_remove? kernel messages, and I/O throughput becomes unstable and low.
I found a GFS2 commit message about ?send_repeat_remove? function.
(https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?id=96006ea6d4eea73466e90ef353bf34e507724e77)


Information about the test environment.

Four hosts share 10 NVMes, and each host deploys CLVM/GFS2 on top of the cluster MD RAID1 + MD RAID0.
GFS2 has 2,000 directories, each with 1,900 media files (3 MB on average).
Each host runs 20 threads of NGINX, and each thread randomly reads media files on demand.
The Linux kernel version is 4.11.8.


Can you offer suggestions or directions to solve these problems?

Thank you in advance :)


Best regards,
/Jay Sung

 I'm copying in our DLM experts. It would be good to open a bug at Red Hat's bugzilla to track this issue (and a customer case too, if you are a customer). It looks like something that will need some investigation to get to the bottom of what is going on. I suspect that a tcpdump of the DLM traffic when the issue occurs would be the first thing to try, so that we can try and match the message to the protocol dump. That may not be easy since I suspect that there is a large quantity of DLM traffic in your set up, and that will make finding the specific messages more tricky.

Just out of interest, what kind of network is this running over? How much bandwidth is DLM taking up?
Steve.

[Linux-cluster] GFS2 DLM problem on NVMes

*From: ??? <bj sung sk com>
*To: "linux-cluster redhat com" <linux-cluster redhat com>
*Subject: [Linux-cluster] GFS2 DLM problem on NVMes
*Date: Mon, 20 Nov 2017 04:23:35 +0000

Hello, List.


We are developing storage systems using 10 NVMes (current test set).
Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads).
However, a GFS2 DLM problem occurred. The problem is that each host frequently reports ?dlm: gfs2: send_repeat_remove? kernel messages, and I/O throughput becomes unstable and low.
I found a GFS2 commit message about ?send_repeat_remove? function.
(https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?id=96006ea6d4eea73466e90ef353bf34e507724e77)

Information about the test environment.

Four hosts share 10 NVMes, and each host deploys CLVM/GFS2 on top of the cluster MD RAID1 + MD RAID0.
GFS2 has 2,000 directories, each with 1,900 media files (3 MB on average).

Each host runs 20 threads of NGINX, and each thread randomly reads media files on demand.
The Linux kernel version is 4.11.8.
Can you offer suggestions or directions to solve these problems?

Thank you in advance :)

Best regards,
/Jay Sung
Jay Sung (Baegjae), Manager | Software Defined Storage Lab | SK Telecom Co., LTD.
bj sung sk com | mobile: +82-10-2087-5637
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20171122/9094db3c/attachment.htm>

From teigland at redhat.com  Wed Nov 22 15:03:32 2017
From: teigland at redhat.com (David Teigland)
Date: Wed, 22 Nov 2017 09:03:32 -0600
Subject: [Linux-cluster] GFS2 DLM problem on NVMes
In-Reply-To: <215748c05f8545deb3e498cd87ebe8c5@skt-tnetpmx2.SKT.AD>
References: <215748c05f8545deb3e498cd87ebe8c5@skt-tnetpmx2.SKT.AD>
Message-ID: <20171122150332.GA24083@redhat.com>

On Wed, Nov 22, 2017 at 04:32:13AM +0000, Eric H. Chang wrote:
> We??ve tested with different ??toss_secs?? as advised. When we
> configured it as 1000, we saw the ??send_repeat_remove?? log after
> 1000sec. We can test with other values on ??toss_secs??, but we think it
> would have the same problem potentially when freeing up the slab after
> the configured sec.

Do you see many of these messages?  Do gfs operations become stuck after
they appear?


From echang at sk.com  Thu Nov 23 05:36:02 2017
From: echang at sk.com (Eric H. Chang)
Date: Thu, 23 Nov 2017 05:36:02 +0000
Subject: [Linux-cluster] GFS2 DLM problem on NVMes
In-Reply-To: <20171122150332.GA24083@redhat.com>
References: <215748c05f8545deb3e498cd87ebe8c5@skt-tnetpmx2.SKT.AD>
	<20171122150332.GA24083@redhat.com>
Message-ID: <cda956773e3045438549f6df52a7542f@skt-tnetpmx2.SKT.AD>

Hi Dave,


When errors started to come out, the system got slower (perf degraded) and lots of error messages showed up repeatedly. Specifically, when the large amount of slab memory was reclaimed such as 9GB to 6GB, the about 30 error messages came out.

?send_repeat_remove? messages were printed about 5 times intermittently as well. But the system didn?t get stuck.


We are running JMeter tool to simulate the CDN workloads and there are 2 million files(3MB size per file) in my storage that are read by 4 host servers.

160Gbps bandwidth were reached using 16 client servers with 10Gb and 4 host servers with 40Gb that runs GFS. Hope this helps you understand my usage.


eric


-----Original Message-----
From: David Teigland [mailto:teigland at redhat.com]
Sent: Thursday, November 23, 2017 12:04 AM
To: ???/SW-Defined Storage Lab <echang at sk.com>
Cc: linux-cluster at redhat.com; swhiteho at redhat.com; mferrell at redhat.com; ???/SW-Defined Storage Lab <bj.sung at sk.com>; ???/SW-Defined Storage Lab <jhyoon01 at sk.com>; ???/SW-Defined Storage Lab <hangjun.min at sk.com>
Subject: Re: [Linux-cluster] GFS2 DLM problem on NVMes


On Wed, Nov 22, 2017 at 04:32:13AM +0000, Eric H. Chang wrote:

> We  ve tested with different   toss_secs   as advised. When we

> configured it as 1000, we saw the   send_repeat_remove   log after

> 1000sec. We can test with other values on   toss_secs  , but we think

> it would have the same problem potentially when freeing up the slab

> after the configured sec.


Do you see many of these messages?  Do gfs operations become stuck after they appear?


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20171123/6024e88e/attachment.htm>