From g.danti at assyoma.it Sat Aug 26 06:11:16 2017 From: g.danti at assyoma.it (Gionatan Danti) Date: Sat, 26 Aug 2017 08:11:16 +0200 Subject: [Linux-cluster] GFS2 as virtual machine disk store Message-ID: Hi list, I am evaluating how to refresh my "standard" cluster configuration and GFS2 clearly is on the table ;) GOAL: to have a 2-node HA cluster running DRBD (active/active), GFS2 (to store disk image) and KVM (as hypervisor). The cluster had to support live migration, but manual failover is sufficient (ie: if something goes wrong, is ok to require a sysadmin to take action to restore services). The idea is to, by default, always run VMs on the first host (using virtlock or sanlock to deny the starting of the same virtual machine from the second host). Should anything bad happen, or should the first host be in maintenance mode, the VMs can be migrated/restarted on the second host. I have a few questions: - other peoples told me GFS2 is not well suited for such a tasks and that I am going to see much lower performance than running on a local filesystem (replicated via other means). This advice stems from the requirement to maintain proper write ordering, but strict cache coherency also between the hosts. However, from what I understand reading GFS2 documentation, when operating mostly on a single host (ie: not running anything on the second node), the overhead should be negligible. I am right, or orribly wrong? - reading RedHat documentation here[1], I see that it is strongly advised to set cache=none for any virtual disk. Is this required from proper operation, or it is "only" a performance optimization to avoid what stated above (ie: two host sharing the same data in pagecache, thus requiring coherency traffic)? As I really like the improved performance with cache=writeback (which, by the virtue of barrier passing, comes without data loss concerns), you think it is safe to use writeback in production? - I plan to have a volume of about 8 or 16 TB. I understand that GFS2 is tested with much bigger volumes (ie: 100 TB), but I would ask: do you would trust a TB-sized volume on GFS2? What about fsck? It works well/reliably? - I plan to put GFS2 on top of LVM (for backup snapshot) and replicate the volume with DRBD2. Do you see any drawback in this approach? - finally, how do you feel about running your production virtual machines on DRBD + GFS2? Thank you all. [1] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Global_File_System_2/index.html#s1-VMsGFS2-gfs2 -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti at assyoma.it - info at assyoma.it GPG public key ID: FF5F32A8 From admin at feldhost.cz Sat Aug 26 09:34:25 2017 From: admin at feldhost.cz (=?utf-8?Q?Kristi=C3=A1n_Feldsam?=) Date: Sat, 26 Aug 2017 11:34:25 +0200 Subject: [Linux-cluster] GFS2 as virtual machine disk store In-Reply-To: References: Message-ID: <53EC3320-C084-440E-BD2A-49D7500A6A78@feldhost.cz> Hello, accroding to red hat documentation "smaller is better". I personaly use 1TB volumes with 256MB journal https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Global_File_System_2/index.html#s1-formatting-gfs2 S pozdravem Kristi?n Feldsam Tel.: +420 773 303 353, +421 944 137 535 E-mail.: support at feldhost.cz www.feldhost.cz - FeldHost? ? profesion?ln? hostingov? a serverov? slu?by za adekv?tn? ceny. FELDSAM s.r.o. V rohu 434/3 Praha 4 ? Libu?, PS? 142 00 I?: 290 60 958, DI?: CZ290 60 958 C 200350 veden? u M?stsk?ho soudu v Praze Banka: Fio banka a.s. ??slo ??tu: 2400330446/2010 BIC: FIOBCZPPXX IBAN: CZ82 2010 0000 0024 0033 0446 > On 26 Aug 2017, at 08:11, Gionatan Danti wrote: > > Hi list, > I am evaluating how to refresh my "standard" cluster configuration and GFS2 clearly is on the table ;) > > GOAL: to have a 2-node HA cluster running DRBD (active/active), GFS2 (to store disk image) and KVM (as hypervisor). The cluster had to support live migration, but manual failover is sufficient (ie: if something goes wrong, is ok to require a sysadmin to take action to restore services). > > The idea is to, by default, always run VMs on the first host (using virtlock or sanlock to deny the starting of the same virtual machine from the second host). Should anything bad happen, or should the first host be in maintenance mode, the VMs can be migrated/restarted on the second host. > > I have a few questions: > > - other peoples told me GFS2 is not well suited for such a tasks and that I am going to see much lower performance than running on a local filesystem (replicated via other means). This advice stems from the requirement to maintain proper write ordering, but strict cache coherency also between the hosts. However, from what I understand reading GFS2 documentation, when operating mostly on a single host (ie: not running anything on the second node), the overhead should be negligible. I am right, or orribly wrong? > > - reading RedHat documentation here[1], I see that it is strongly advised to set cache=none for any virtual disk. Is this required from proper operation, or it is "only" a performance optimization to avoid what stated above (ie: two host sharing the same data in pagecache, thus requiring coherency traffic)? As I really like the improved performance with cache=writeback (which, by the virtue of barrier passing, comes without data loss concerns), you think it is safe to use writeback in production? > > - I plan to have a volume of about 8 or 16 TB. I understand that GFS2 is tested with much bigger volumes (ie: 100 TB), but I would ask: do you would trust a TB-sized volume on GFS2? What about fsck? It works well/reliably? > > - I plan to put GFS2 on top of LVM (for backup snapshot) and replicate the volume with DRBD2. Do you see any drawback in this approach? > > - finally, how do you feel about running your production virtual machines on DRBD + GFS2? > > Thank you all. > > [1] https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Global_File_System_2/index.html#s1-VMsGFS2-gfs2 > > -- > Danti Gionatan > Supporto Tecnico > Assyoma S.r.l. - www.assyoma.it > email: g.danti at assyoma.it - info at assyoma.it > GPG public key ID: FF5F32A8 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.danti at assyoma.it Sat Aug 26 17:25:03 2017 From: g.danti at assyoma.it (Gionatan Danti) Date: Sat, 26 Aug 2017 19:25:03 +0200 Subject: [Linux-cluster] GFS2 as virtual machine disk store In-Reply-To: <53EC3320-C084-440E-BD2A-49D7500A6A78@feldhost.cz> References: <53EC3320-C084-440E-BD2A-49D7500A6A78@feldhost.cz> Message-ID: <1fc8248ef2923176aaa4a6925810e886@assyoma.it> Il 26-08-2017 11:34 Kristi?n Feldsam ha scritto: > Hello, accroding to red hat documentation "smaller is better". I > personaly use 1TB volumes with 256MB journal > > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Global_File_System_2/index.html#s1-formatting-gfs2 Sure, but these are general recommendation useful/valid for any filesystem. I would like to know if their are *required* to extract good performance from GFS2 or not. Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti at assyoma.it - info at assyoma.it GPG public key ID: FF5F32A8 From swhiteho at redhat.com Tue Aug 29 09:45:44 2017 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 29 Aug 2017 10:45:44 +0100 Subject: [Linux-cluster] GFS2 as virtual machine disk store In-Reply-To: References: Message-ID: <06898de7-c545-ba5d-3f70-1dece94b2a4d@redhat.com> Hi, On 26/08/17 07:11, Gionatan Danti wrote: > Hi list, > I am evaluating how to refresh my "standard" cluster configuration and > GFS2 clearly is on the table ;) > > GOAL: to have a 2-node HA cluster running DRBD (active/active), GFS2 > (to store disk image) and KVM (as hypervisor). The cluster had to > support live migration, but manual failover is sufficient (ie: if > something goes wrong, is ok to require a sysadmin to take action to > restore services). > > The idea is to, by default, always run VMs on the first host (using > virtlock or sanlock to deny the starting of the same virtual machine > from the second host). Should anything bad happen, or should the first > host be in maintenance mode, the VMs can be migrated/restarted on the > second host. > > I have a few questions: > > - other peoples told me GFS2 is not well suited for such a tasks and > that I am going to see much lower performance than running on a local > filesystem (replicated via other means). This advice stems from the > requirement to maintain proper write ordering, but strict cache > coherency also between the hosts. However, from what I understand > reading GFS2 documentation, when operating mostly on a single host > (ie: not running anything on the second node), the overhead should be > negligible. I am right, or orribly wrong? > Yes, there is some additional overhead due to the clustering. You can however usually organise things so that the overheads are minimised as you mentioned above by being careful about the workload. > - reading RedHat documentation here[1], I see that it is strongly > advised to set cache=none for any virtual disk. Is this required from > proper operation, or it is "only" a performance optimization to avoid > what stated above (ie: two host sharing the same data in pagecache, > thus requiring coherency traffic)? As I really like the improved > performance with cache=writeback (which, by the virtue of barrier > passing, comes without data loss concerns), you think it is safe to > use writeback in production? No. You want to use the default data=ordered for the most part. It is less a question of data loss and more a question of whether in case of a power outage it is possible for a file being written to, to land up with incorrect content. That can happen in the data=writeback case (where block allocation has succeeded, but the new data has not yet been written to disk) but not in the data=ordered case. > > - I plan to have a volume of about 8 or 16 TB. I understand that GFS2 > is tested with much bigger volumes (ie: 100 TB), but I would ask: do > you would trust a TB-sized volume on GFS2? What about fsck? It works > well/reliably? Yes, it works well. The size limit was based on fsck time, rather than any reliability issues. It will work reliably at much larger sizes, but it will take longer and use more memory. I hope that answers a few more of your questions, Steve. > > - I plan to put GFS2 on top of LVM (for backup snapshot) and replicate > the volume with DRBD2. Do you see any drawback in this approach? > > - finally, how do you feel about running your production virtual > machines on DRBD + GFS2? > > Thank you all. > > [1] > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Global_File_System_2/index.html#s1-VMsGFS2-gfs2 > From g.danti at assyoma.it Tue Aug 29 10:54:28 2017 From: g.danti at assyoma.it (Gionatan Danti) Date: Tue, 29 Aug 2017 12:54:28 +0200 Subject: [Linux-cluster] GFS2 as virtual machine disk store In-Reply-To: <06898de7-c545-ba5d-3f70-1dece94b2a4d@redhat.com> References: <06898de7-c545-ba5d-3f70-1dece94b2a4d@redhat.com> Message-ID: <01bdb3a132a19fe13a240249c5b5bff7@assyoma.it> Hi Steven, Il 29-08-2017 11:45 Steven Whitehouse ha scritto: > Yes, there is some additional overhead due to the clustering. You can > however usually organise things so that the overheads are minimised as > you mentioned above by being careful about the workload. > > No. You want to use the default data=ordered for the most part. It is > less a question of data loss and more a question of whether in case of > a power outage it is possible for a file being written to, to land up > with incorrect content. That can happen in the data=writeback case > (where block allocation has succeeded, but the new data has not yet > been written to disk) but not in the data=ordered case. I think there is a misunderstanding: I am not speaking about filesystem mount options (data=ordered vs data=writeback), rather on the QEMU virtual disk caching mode: on Red Hat documentation, it is suggested to set QEMU vdisk in cache=none mode. However, cache=writeback has some significant performance advantages in a number of situations. As, since at least 5 years, QEMU with cache=writeback supports barrier passing and so it is safe to use, I wondered why Red Hat officially suggest to avoid it on GFS2. I suspect it is related to the performance degradation due to cache coherence between the two hosts, but I would like to be certain in not related to inherently unsafe operation on GFS2. > Yes, it works well. The size limit was based on fsck time, rather than > any reliability issues. It will work reliably at much larger sizes, > but it will take longer and use more memory. Great. Any advice on how much time is needed for full fsck on a 8+ TB volume? > I hope that answers a few more of your questions, > > Steve. Absolutely great info. Thank you very much Steve. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti at assyoma.it - info at assyoma.it GPG public key ID: FF5F32A8 From swhiteho at redhat.com Tue Aug 29 10:59:22 2017 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 29 Aug 2017 11:59:22 +0100 Subject: [Linux-cluster] GFS2 as virtual machine disk store In-Reply-To: <01bdb3a132a19fe13a240249c5b5bff7@assyoma.it> References: <06898de7-c545-ba5d-3f70-1dece94b2a4d@redhat.com> <01bdb3a132a19fe13a240249c5b5bff7@assyoma.it> Message-ID: <9f874ecb-d1f4-9ac7-5644-969cf0e16cb5@redhat.com> On 29/08/17 11:54, Gionatan Danti wrote: > Hi Steven, > > Il 29-08-2017 11:45 Steven Whitehouse ha scritto: >> Yes, there is some additional overhead due to the clustering. You can >> however usually organise things so that the overheads are minimised as >> you mentioned above by being careful about the workload. >> >> No. You want to use the default data=ordered for the most part. It is >> less a question of data loss and more a question of whether in case of >> a power outage it is possible for a file being written to, to land up >> with incorrect content. That can happen in the data=writeback case >> (where block allocation has succeeded, but the new data has not yet >> been written to disk) but not in the data=ordered case. > > I think there is a misunderstanding: I am not speaking about > filesystem mount options (data=ordered vs data=writeback), rather on > the QEMU virtual disk caching mode: on Red Hat documentation, it is > suggested to set QEMU vdisk in cache=none mode. However, > cache=writeback has some significant performance advantages in a > number of situations. As, since at least 5 years, QEMU with > cache=writeback supports barrier passing and so it is safe to use, I > wondered why Red Hat officially suggest to avoid it on GFS2. I suspect > it is related to the performance degradation due to cache coherence > between the two hosts, but I would like to be certain in not related > to inherently unsafe operation on GFS2. > Yes, it definitely needs to be set to cache=none mode. Barrier passing is only one issue, and as you say it is down to the cache coherency, since the block layer is not aware of the caching requirements of the upper layers in this case. >> Yes, it works well. The size limit was based on fsck time, rather than >> any reliability issues. It will work reliably at much larger sizes, >> but it will take longer and use more memory. > > Great. Any advice on how much time is needed for full fsck on a 8+ TB > volume? It will depend a great deal on a number of factors... the performance of the storage and also the number of inodes on the filesystem. It will also take longer if there is any work to do (i.e. if changes need to be made compared with just checking an otherwise clean filesystem) so it is difficult to give any guidance without knowing those variables. The best way to know is to try it and see, Steve. From g.danti at assyoma.it Tue Aug 29 11:07:30 2017 From: g.danti at assyoma.it (Gionatan Danti) Date: Tue, 29 Aug 2017 13:07:30 +0200 Subject: [Linux-cluster] GFS2 as virtual machine disk store In-Reply-To: <9f874ecb-d1f4-9ac7-5644-969cf0e16cb5@redhat.com> References: <06898de7-c545-ba5d-3f70-1dece94b2a4d@redhat.com> <01bdb3a132a19fe13a240249c5b5bff7@assyoma.it> <9f874ecb-d1f4-9ac7-5644-969cf0e16cb5@redhat.com> Message-ID: <165c8dc8512b5c0781d3ecbba8405f39@assyoma.it> Il 29-08-2017 12:59 Steven Whitehouse ha scritto: >> > Yes, it definitely needs to be set to cache=none mode. Barrier passing > is only one issue, and as you say it is down to the cache coherency, > since the block layer is not aware of the caching requirements of the > upper layers in this case. Ok. Sorry to be pedantic, but I would be sure to grasp it: using cache=writeback will *not* cause data corruption, rather low performance if/when the cache coherency protocol is needed (ie: when live migrate a VM from first to second host), correct? > It will depend a great deal on a number of factors... the performance > of the storage and also the number of inodes on the filesystem. It > will also take longer if there is any work to do (i.e. if changes need > to be made compared with just checking an otherwise clean filesystem) > so it is difficult to give any guidance without knowing those > variables. The best way to know is to try it and see, > > Steve. Ok, so some first-hand testing will be needed. Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti at assyoma.it - info at assyoma.it GPG public key ID: FF5F32A8 From swhiteho at redhat.com Tue Aug 29 11:13:02 2017 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 29 Aug 2017 12:13:02 +0100 Subject: [Linux-cluster] GFS2 as virtual machine disk store In-Reply-To: <165c8dc8512b5c0781d3ecbba8405f39@assyoma.it> References: <06898de7-c545-ba5d-3f70-1dece94b2a4d@redhat.com> <01bdb3a132a19fe13a240249c5b5bff7@assyoma.it> <9f874ecb-d1f4-9ac7-5644-969cf0e16cb5@redhat.com> <165c8dc8512b5c0781d3ecbba8405f39@assyoma.it> Message-ID: <6cf6145c-46fc-599c-bacc-dcfbc7a1f23c@redhat.com> Hi, On 29/08/17 12:07, Gionatan Danti wrote: > Il 29-08-2017 12:59 Steven Whitehouse ha scritto: >>> >> Yes, it definitely needs to be set to cache=none mode. Barrier passing >> is only one issue, and as you say it is down to the cache coherency, >> since the block layer is not aware of the caching requirements of the >> upper layers in this case. > > Ok. Sorry to be pedantic, but I would be sure to grasp it: using > cache=writeback will *not* cause data corruption, rather low > performance if/when the cache coherency protocol is needed (ie: when > live migrate a VM from first to second host), correct? Whatever kind of storage is being used with GFS2, it needs to act as if there was no cache or as if there is a common cache between all nodes - what we want to avoid is caches which are specific to each node. Using individual node caching will still cause issues in case, for example, one node has cached a block that another node has changed. In that case the node with the cached information will use that, rather than rereading from disk which is where the newly changed information is. So it is a question of ensuring that all nodes "see" the same data at all times, Steve From g.danti at assyoma.it Tue Aug 29 11:26:05 2017 From: g.danti at assyoma.it (Gionatan Danti) Date: Tue, 29 Aug 2017 13:26:05 +0200 Subject: [Linux-cluster] GFS2 as virtual machine disk store In-Reply-To: <6cf6145c-46fc-599c-bacc-dcfbc7a1f23c@redhat.com> References: <06898de7-c545-ba5d-3f70-1dece94b2a4d@redhat.com> <01bdb3a132a19fe13a240249c5b5bff7@assyoma.it> <9f874ecb-d1f4-9ac7-5644-969cf0e16cb5@redhat.com> <165c8dc8512b5c0781d3ecbba8405f39@assyoma.it> <6cf6145c-46fc-599c-bacc-dcfbc7a1f23c@redhat.com> Message-ID: <4bdf875ba23c20e13706a5cb6774d39f@assyoma.it> Il 29-08-2017 13:13 Steven Whitehouse ha scritto: > Whatever kind of storage is being used with GFS2, it needs to act as > if there was no cache or as if there is a common cache between all > nodes - what we want to avoid is caches which are specific to each > node. Using individual node caching will still cause issues in case, > for example, one node has cached a block that another node has > changed. In that case the node with the cached information will use > that, rather than rereading from disk which is where the newly changed > information is. So it is a question of ensuring that all nodes "see" > the same data at all times, > > Steve From my understanding (and I can be wrong...) GFS2 will itself take care of cache coherency between hosts. For example, if: - node A read a file; - node B read and write the same file; - node A re-read the same file; GFS2 should be able to guarantee a consistent view on the file on both nodes. Obviously this come with a price: a significant overhead when reading/writing the same (cache) files. I am missing something? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti at assyoma.it - info at assyoma.it GPG public key ID: FF5F32A8 From swhiteho at redhat.com Tue Aug 29 11:28:58 2017 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 29 Aug 2017 12:28:58 +0100 Subject: [Linux-cluster] GFS2 as virtual machine disk store In-Reply-To: <4bdf875ba23c20e13706a5cb6774d39f@assyoma.it> References: <06898de7-c545-ba5d-3f70-1dece94b2a4d@redhat.com> <01bdb3a132a19fe13a240249c5b5bff7@assyoma.it> <9f874ecb-d1f4-9ac7-5644-969cf0e16cb5@redhat.com> <165c8dc8512b5c0781d3ecbba8405f39@assyoma.it> <6cf6145c-46fc-599c-bacc-dcfbc7a1f23c@redhat.com> <4bdf875ba23c20e13706a5cb6774d39f@assyoma.it> Message-ID: <0f5168f4-4d75-4dbf-ed65-cbd83e498e6e@redhat.com> Hi, On 29/08/17 12:26, Gionatan Danti wrote: > Il 29-08-2017 13:13 Steven Whitehouse ha scritto: >> Whatever kind of storage is being used with GFS2, it needs to act as >> if there was no cache or as if there is a common cache between all >> nodes - what we want to avoid is caches which are specific to each >> node. Using individual node caching will still cause issues in case, >> for example, one node has cached a block that another node has >> changed. In that case the node with the cached information will use >> that, rather than rereading from disk which is where the newly changed >> information is. So it is a question of ensuring that all nodes "see" >> the same data at all times, >> >> Steve > > From my understanding (and I can be wrong...) GFS2 will itself take > care of cache coherency between hosts. > > For example, if: > - node A read a file; > - node B read and write the same file; > - node A re-read the same file; > GFS2 should be able to guarantee a consistent view on the file on both > nodes. Obviously this come with a price: a significant overhead when > reading/writing the same (cache) files. > > I am missing something? > Thanks. > There is no siginificant overhead when reading the same file on multiple nodes. The overhead mostly applies when writes are involved in some form, whether mixed with other writes or reads. GFS2 does ensure cache coherency, but in order to do that it requires certain properties from the storage, and hence the requirement for a symmetric view of the storage, Steve. From g.danti at assyoma.it Tue Aug 29 11:32:42 2017 From: g.danti at assyoma.it (Gionatan Danti) Date: Tue, 29 Aug 2017 13:32:42 +0200 Subject: [Linux-cluster] GFS2 as virtual machine disk store In-Reply-To: <0f5168f4-4d75-4dbf-ed65-cbd83e498e6e@redhat.com> References: <06898de7-c545-ba5d-3f70-1dece94b2a4d@redhat.com> <01bdb3a132a19fe13a240249c5b5bff7@assyoma.it> <9f874ecb-d1f4-9ac7-5644-969cf0e16cb5@redhat.com> <165c8dc8512b5c0781d3ecbba8405f39@assyoma.it> <6cf6145c-46fc-599c-bacc-dcfbc7a1f23c@redhat.com> <4bdf875ba23c20e13706a5cb6774d39f@assyoma.it> <0f5168f4-4d75-4dbf-ed65-cbd83e498e6e@redhat.com> Message-ID: Il 29-08-2017 13:28 Steven Whitehouse ha scritto: > There is no siginificant overhead when reading the same file on > multiple nodes. The overhead mostly applies when writes are involved > in some form, whether mixed with other writes or reads. GFS2 does > ensure cache coherency, but in order to do that it requires certain > properties from the storage, and hence the requirement for a symmetric > view of the storage, > > Steve. Ok. As a note, I plan to use DRBD in active/active setup as a storage backend. Thank you Steven. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti at assyoma.it - info at assyoma.it GPG public key ID: FF5F32A8