From bj.sung at sk.com Mon Nov 20 04:23:35 2017 From: bj.sung at sk.com (=?ks_c_5601-1987?B?vLq56cDn?=) Date: Mon, 20 Nov 2017 04:23:35 +0000 Subject: [Linux-cluster] GFS2 DLM problem on NVMes Message-ID: Hello, List. We are developing storage systems using 10 NVMes (current test set). Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads). However, a GFS2 DLM problem occurred. The problem is that each host frequently reports ?dlm: gfs2: send_repeat_remove? kernel messages, and I/O throughput becomes unstable and low. I found a GFS2 commit message about ?send_repeat_remove? function. (https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?id=96006ea6d4eea73466e90ef353bf34e507724e77) Information about the test environment. Four hosts share 10 NVMes, and each host deploys CLVM/GFS2 on top of the cluster MD RAID1 + MD RAID0. GFS2 has 2,000 directories, each with 1,900 media files (3 MB on average). Each host runs 20 threads of NGINX, and each thread randomly reads media files on demand. The Linux kernel version is 4.11.8. Can you offer suggestions or directions to solve these problems? Thank you in advance :) Best regards, /Jay Sung Jay Sung (Baegjae), Manager | Software Defined Storage Lab | SK Telecom Co., LTD. bj.sung at sk.com | mobile: +82-10-2087-5637 -------------- next part -------------- An HTML attachment was scrubbed... URL: From swhiteho at redhat.com Mon Nov 20 10:40:54 2017 From: swhiteho at redhat.com (Steven Whitehouse) Date: Mon, 20 Nov 2017 10:40:54 +0000 Subject: [Linux-cluster] GFS2 DLM problem on NVMes In-Reply-To: References: Message-ID: <535dc07b-a780-29be-1f1e-077e568041e4@redhat.com> Hi, On 20/11/17 04:23, ??? wrote: > > Hello, List. > > We are developing storage systems using 10 NVMes (current test set). > > Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on > Reads). > > However, a GFS2 DLM problem occurred. The problem is that each host > frequently reports ?dlm: gfs2: send_repeat_remove? kernel messages, > and I/O throughput becomes unstable and low. > > I found a GFS2 commit message about ?send_repeat_remove? function. > > (https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?id=96006ea6d4eea73466e90ef353bf34e507724e77) > > Information about the test environment. > > Four hosts share 10 NVMes, and each host deploys CLVM/GFS2 on top of > the cluster MD RAID1 + MD RAID0. > > GFS2 has 2,000 directories, each with 1,900 media files (3 MB on average). > > Each host runs 20 threads of NGINX, and each thread randomly reads > media files on demand. > > The Linux kernel version is 4.11.8. > > Can you offer suggestions or directions to solve these problems? > > Thank you in advance :) > > Best regards, > > /Jay Sung > I'm copying in our DLM experts. It would be good to open a bug at Red Hat's bugzilla to track this issue (and a customer case too, if you are a customer). It looks like something that will need some investigation to get to the bottom of what is going on. I suspect that a tcpdump of the DLM traffic when the issue occurs would be the first thing to try, so that we can try and match the message to the protocol dump. That may not be easy since I suspect that there is a large quantity of DLM traffic in your set up, and that will make finding the specific messages more tricky. Just out of interest, what kind of network is this running over? How much bandwidth is DLM taking up? Steve. -------------- next part -------------- An HTML attachment was scrubbed... URL: From teigland at redhat.com Mon Nov 20 19:09:32 2017 From: teigland at redhat.com (David Teigland) Date: Mon, 20 Nov 2017 13:09:32 -0600 Subject: [Linux-cluster] GFS2 DLM problem on NVMes In-Reply-To: References: Message-ID: <20171120190932.GC29888@redhat.com> > We are developing storage systems using 10 NVMes (current test set). > Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads). Does MD RAID10 work correctly under GFS2? Does the RAID10 make use of the recent md-cluster enhancements (which also use the dlm)? > However, a GFS2 DLM problem occurred. The problem is that each host > frequently reports dlm: gfs2: send_repeat_remove kernel messages, > and I/O throughput becomes unstable and low. send_repeat_remove is a mysterious corner case, related to the resource directory becoming out of sync with the actual resource master. There's an inherent race in this area of the dlm which is hard to solve because the same record (mapping of resource name to master nodeid) needs to be changed consistently on two nodes. Perhaps in the future the dlm could be enhanced with some algorithm to do that better. For now, it just repeats the change (logging the message you see). If the repeated operation is working, then things won't be permanently stuck. The most likely cause, it seems to me, is that the speed of storage relative to the speed of the network is triggering pathological timing issues in the dlm. Try adjusting the "toss_secs" tunable, which controls how long a node will hold on to an unused resource before giving up mastery of it (the master change is what leads to the inconsistency mentioned above.) echo 1000 > /sys/kernel/config/dlm/cluster/toss_secs The default is 10, I'd try 100/1000/10000. A number too large could have negative consequences of not freeing enough dlm resources that will never be used again, e.g. if you are deleting a lot of files. Set this number before mounting gfs for it to take effect. In the past, I think that send_repeat_remove has tended to appear when there's a huge volume of dlm messages, triggered by excessive caching done by gfs when there's a large amount of system memory. The huge volume of dlm messages results in the messages appearing in unusual sequences, reversing the usual cause-effect. Dave From echang at sk.com Wed Nov 22 04:32:13 2017 From: echang at sk.com (Eric H. Chang) Date: Wed, 22 Nov 2017 04:32:13 +0000 Subject: [Linux-cluster] GFS2 DLM problem on NVMes Message-ID: <215748c05f8545deb3e498cd87ebe8c5@skt-tnetpmx2.SKT.AD> Hi Dave and Steven, Thank you for the assistance. We made some progress here and would like to share with you. #1. We?ve set ?vm.vfs_cache_pressure? to zero and ran tests. As a result, we couldn?t see the same problem happening and observed that the slab grew slowly and saturated to 25GB during the overnight test. We will keep running test with this, but it?d be appeciated if you can advise any risks when we stick with this config. #2. We?ve tested with different ?toss_secs? as advised. When we configured it as 1000, we saw the ?send_repeat_remove? log after 1000sec. We can test with other values on ?toss_secs?, but we think it would have the same problem potentially when freeing up the slab after the configured sec. Do our results make sense to you? Best Regards, Eric Chang(Hong-seok), Manager | Software Defined Storage Lab | SK Telecom Co., LTD. echang at sk.com | mobile: +82-10-4996-3690 | skype: ehschang Re: [Linux-cluster] GFS2 DLM problem on NVMes *From: David Teigland *To: bj sung sk com *Cc: linux-cluster redhat com *Subject: Re: [Linux-cluster] GFS2 DLM problem on NVMes *Date: Mon, 20 Nov 2017 13:09:32 -0600 > We are developing storage systems using 10 NVMes (current test set). > Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads). Does MD RAID10 work correctly under GFS2? Does the RAID10 make use of the recent md-cluster enhancements (which also use the dlm)? > However, a GFS2 DLM problem occurred. The problem is that each host > frequently reports dlm: gfs2: send_repeat_remove kernel messages, > and I/O throughput becomes unstable and low. send_repeat_remove is a mysterious corner case, related to the resource directory becoming out of sync with the actual resource master. There's an inherent race in this area of the dlm which is hard to solve because the same record (mapping of resource name to master nodeid) needs to be changed consistently on two nodes. Perhaps in the future the dlm could be enhanced with some algorithm to do that better. For now, it just repeats the change (logging the message you see). If the repeated operation is working, then things won't be permanently stuck. The most likely cause, it seems to me, is that the speed of storage relative to the speed of the network is triggering pathological timing issues in the dlm. Try adjusting the "toss_secs" tunable, which controls how long a node will hold on to an unused resource before giving up mastery of it (the master change is what leads to the inconsistency mentioned above.) echo 1000 > /sys/kernel/config/dlm/cluster/toss_secs The default is 10, I'd try 100/1000/10000. A number too large could have negative consequences of not freeing enough dlm resources that will never be used again, e.g. if you are deleting a lot of files. Set this number before mounting gfs for it to take effect. In the past, I think that send_repeat_remove has tended to appear when there's a huge volume of dlm messages, triggered by excessive caching done by gfs when there's a large amount of system memory. The huge volume of dlm messages results in the messages appearing in unusual sequences, reversing the usual cause-effect. Dave Re: [Linux-cluster] GFS2 DLM problem on NVMes *From: Steven Whitehouse *To: linux-cluster redhat com, Mark Ferrell , David Teigland *Subject: Re: [Linux-cluster] GFS2 DLM problem on NVMes *Date: Mon, 20 Nov 2017 10:40:54 +0000 Hi, On 20/11/17 04:23, ??? wrote: Hello, List. We are developing storage systems using 10 NVMes (current test set). Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads). However, a GFS2 DLM problem occurred. The problem is that each host frequently reports ?dlm: gfs2: send_repeat_remove? kernel messages, and I/O throughput becomes unstable and low. I found a GFS2 commit message about ?send_repeat_remove? function. (https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?id=96006ea6d4eea73466e90ef353bf34e507724e77) Information about the test environment. Four hosts share 10 NVMes, and each host deploys CLVM/GFS2 on top of the cluster MD RAID1 + MD RAID0. GFS2 has 2,000 directories, each with 1,900 media files (3 MB on average). Each host runs 20 threads of NGINX, and each thread randomly reads media files on demand. The Linux kernel version is 4.11.8. Can you offer suggestions or directions to solve these problems? Thank you in advance :) Best regards, /Jay Sung I'm copying in our DLM experts. It would be good to open a bug at Red Hat's bugzilla to track this issue (and a customer case too, if you are a customer). It looks like something that will need some investigation to get to the bottom of what is going on. I suspect that a tcpdump of the DLM traffic when the issue occurs would be the first thing to try, so that we can try and match the message to the protocol dump. That may not be easy since I suspect that there is a large quantity of DLM traffic in your set up, and that will make finding the specific messages more tricky. Just out of interest, what kind of network is this running over? How much bandwidth is DLM taking up? Steve. [Linux-cluster] GFS2 DLM problem on NVMes *From: ??? *To: "linux-cluster redhat com" *Subject: [Linux-cluster] GFS2 DLM problem on NVMes *Date: Mon, 20 Nov 2017 04:23:35 +0000 Hello, List. We are developing storage systems using 10 NVMes (current test set). Using MD RAID10 + CLVM/GFS2 over four hosts achieves 22 GB/s (Max. on Reads). However, a GFS2 DLM problem occurred. The problem is that each host frequently reports ?dlm: gfs2: send_repeat_remove? kernel messages, and I/O throughput becomes unstable and low. I found a GFS2 commit message about ?send_repeat_remove? function. (https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?id=96006ea6d4eea73466e90ef353bf34e507724e77) Information about the test environment. Four hosts share 10 NVMes, and each host deploys CLVM/GFS2 on top of the cluster MD RAID1 + MD RAID0. GFS2 has 2,000 directories, each with 1,900 media files (3 MB on average). Each host runs 20 threads of NGINX, and each thread randomly reads media files on demand. The Linux kernel version is 4.11.8. Can you offer suggestions or directions to solve these problems? Thank you in advance :) Best regards, /Jay Sung Jay Sung (Baegjae), Manager | Software Defined Storage Lab | SK Telecom Co., LTD. bj sung sk com | mobile: +82-10-2087-5637 -------------- next part -------------- An HTML attachment was scrubbed... URL: From teigland at redhat.com Wed Nov 22 15:03:32 2017 From: teigland at redhat.com (David Teigland) Date: Wed, 22 Nov 2017 09:03:32 -0600 Subject: [Linux-cluster] GFS2 DLM problem on NVMes In-Reply-To: <215748c05f8545deb3e498cd87ebe8c5@skt-tnetpmx2.SKT.AD> References: <215748c05f8545deb3e498cd87ebe8c5@skt-tnetpmx2.SKT.AD> Message-ID: <20171122150332.GA24083@redhat.com> On Wed, Nov 22, 2017 at 04:32:13AM +0000, Eric H. Chang wrote: > We??ve tested with different ??toss_secs?? as advised. When we > configured it as 1000, we saw the ??send_repeat_remove?? log after > 1000sec. We can test with other values on ??toss_secs??, but we think it > would have the same problem potentially when freeing up the slab after > the configured sec. Do you see many of these messages? Do gfs operations become stuck after they appear? From echang at sk.com Thu Nov 23 05:36:02 2017 From: echang at sk.com (Eric H. Chang) Date: Thu, 23 Nov 2017 05:36:02 +0000 Subject: [Linux-cluster] GFS2 DLM problem on NVMes In-Reply-To: <20171122150332.GA24083@redhat.com> References: <215748c05f8545deb3e498cd87ebe8c5@skt-tnetpmx2.SKT.AD> <20171122150332.GA24083@redhat.com> Message-ID: Hi Dave, When errors started to come out, the system got slower (perf degraded) and lots of error messages showed up repeatedly. Specifically, when the large amount of slab memory was reclaimed such as 9GB to 6GB, the about 30 error messages came out. ?send_repeat_remove? messages were printed about 5 times intermittently as well. But the system didn?t get stuck. We are running JMeter tool to simulate the CDN workloads and there are 2 million files(3MB size per file) in my storage that are read by 4 host servers. 160Gbps bandwidth were reached using 16 client servers with 10Gb and 4 host servers with 40Gb that runs GFS. Hope this helps you understand my usage. eric -----Original Message----- From: David Teigland [mailto:teigland at redhat.com] Sent: Thursday, November 23, 2017 12:04 AM To: ???/SW-Defined Storage Lab Cc: linux-cluster at redhat.com; swhiteho at redhat.com; mferrell at redhat.com; ???/SW-Defined Storage Lab ; ???/SW-Defined Storage Lab ; ???/SW-Defined Storage Lab Subject: Re: [Linux-cluster] GFS2 DLM problem on NVMes On Wed, Nov 22, 2017 at 04:32:13AM +0000, Eric H. Chang wrote: > We ve tested with different toss_secs as advised. When we > configured it as 1000, we saw the send_repeat_remove log after > 1000sec. We can test with other values on toss_secs , but we think > it would have the same problem potentially when freeing up the slab > after the configured sec. Do you see many of these messages? Do gfs operations become stuck after they appear? -------------- next part -------------- An HTML attachment was scrubbed... URL: