<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> <p>On 12/7/22 08:39, <a class="moz-txt-link-abbreviated" href="mailto:hostalp@post.cz">hostalp@post.cz</a> wrote:<br> </p> <blockquote type="cite" cite="mid:79n.fbvh.4VU7KPD75wp.1Za9UW@seznam.cz"> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <div>Hello.</div> <div>log from the previous boot until the freeze attached.</div> <div>Speaking of suspend/resume I can see some in there.</div> <div><br> </div> <div>First there was a suspend of the original VDO-over-LVM device before the conversion to LVM-VDO via the lvm_import_vdo script - Dec 3 06:19:43<br> </div> <div>After the conversion the new device started (resumed) as a different one.</div> <div>Not sure if the issue you mention could also affect such cases.</div> </blockquote> <p>The conversion script should be completely shutting the device down and starting it up from scratch. The resume and suspend operations are used as part of startup and shutdown respectively, but it's specifically the suspend-then-resume sequence after having done any sort of write operations (or anything that would create journal entries) that can trigger the problem. The conversion script shouldn't do that.</p> <p>The bug is written up at <a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=2109047">https://bugzilla.redhat.com/show_bug.cgi?id=2109047</a> and some other linked tickets.<br> </p> <blockquote type="cite" cite="mid:79n.fbvh.4VU7KPD75wp.1Za9UW@seznam.cz"> <div>Then there was another suspend/resume cycle (actually 2) related to the renaming of the VDO pool LV. First I set it to a different name (mostly to match the naming convention of newly created devices as the conversion script uses slightly different convention than lvcreate), then (after looking at it and thinking about it more thoroughly) I renamed it back (basically I chose my own name that was coincidentally identical to the original one generated by the conversion script) - Dec 3 06:40:44 - 06:43:37</div> </blockquote> Yes, this looks like it could well have done it.<br> <blockquote type="cite" cite="mid:79n.fbvh.4VU7KPD75wp.1Za9UW@seznam.cz"> <div><br> </div> <div>The freeze occurred the next day early in the morning - first symptoms visible in the attached log: Dec 4 01:53:01 - e.g. some 19 hours later.</div> <div><br> </div> <div>If the freeze was due to those earlier suspend/resume cycles (either related to the conversion to LVM-VDO, or to the later VDO pool LV renaming) - how to properly handle such situations then (without the restart)? Of course I didn't explicitly perform any suspend/resume myself there.<br> </div> </blockquote> Using "lvchange -an" on the logical volume stored in VDO (post conversion) should shut VDO down and clear out the incorrect data structures. (You can confirm that with "dmsetup table" -- the "vdo" entry should disappear.) Then "lvchange -ay" to make the logical volume available will start VDO again with the internal data structures in a clean state.<br> <blockquote type="cite" cite="mid:79n.fbvh.4VU7KPD75wp.1Za9UW@seznam.cz"> <div><br> </div> <div>As for the symptomps, the VDO disk was completely frozen, not just slow.</div> <div>If it occurs again I'll collect some more information as you suggest.</div> <div><br> </div> <div>Best regards,</div> <div>Petr<br> </div> </blockquote> <p><br> </p> <p>Thanks. Hopefully, if you can avoid operations that involve suspends (we documented things like growing the storage, but renaming didn't occur to me), or have the opportunity to stop and restart the device soon afterwards, you shouldn't see it again...</p> <p>Ken<br> </p> <blockquote type="cite" cite="mid:79n.fbvh.4VU7KPD75wp.1Za9UW@seznam.cz"> <div><br> </div> <aside>---------- Original message ----------<br> From: Ken Raeburn <a class="moz-txt-link-rfc2396E" href="mailto:raeburn@redhat.com"><raeburn@redhat.com></a><br> To: <a class="moz-txt-link-abbreviated" href="mailto:hostalp@post.cz">hostalp@post.cz</a>, <a class="moz-txt-link-abbreviated" href="mailto:vdo-devel@redhat.com">vdo-devel@redhat.com</a><br> Sent: 7. 12. 2022 5:06:30<br> Subject: Re: [vdo-devel] Rocky Linux 8.7 & LVM-VDO stability?</aside> <br> <blockquote data-email="raeburn@redhat.com">Do you have the rest of the kernel log from that boot session? I'd be <br> curious to see what preceded the lockup. <br> <br> There is a known bug which can result in a lockup of the device, but it <br> occurs after the device has been suspended and resumed. That's different <br> from shutting it down completely and starting it up again, which is what <br> the conversion process does. We've got a fix for it in the RHEL (and <br> CentOS) 9 code streams, but for the RHEL 8 version the recommended <br> workaround is to fully stop and then restart the device as soon as <br> possible after a suspend/restore sequence. <br> <br> The suspend and restore doesn't have to be explicit on the part of the <br> user; it can happen implicitly as part of adding more physical storage <br> or changing some of the configuration parameters, as suspend/resume is <br> done as part of loading a new configuration into the kernel. So if you <br> made a configuration change after the upgrade, that could have tripped <br> the bug. <br> <br> If that wasn't it, maybe there's some other clue in the kernel log... <br> <br> If it should come up again, there are a few things to look at: <br> <br> - First, is it really frozen or just slow? The sar or iostat programs <br> can show you if I/O is happening. <br> <br> - Are any of the VDO threads using any CPU time? <br> <br> - Try running "dmsetup message <vdo-name> 0 dump all" where vdo-name is <br> the device name in /dev/mapper, perhaps something like <br> vdovg-vdolvol_vpool-vpool if you let the conversion script pick the <br> names. Sending this message to VDO will cause it to write a bunch of <br> info to the kernel log, which might give us some more insight into the <br> problem. <br> <br> Ken <br> <br> On 12/5/22 19:39, <a class="moz-txt-link-abbreviated" href="mailto:hostalp@post.cz">hostalp@post.cz</a> wrote: <br> > Hello, <br> > until recently I was running a Rocky Linux 8.5 VM (at Proxmox 7 <br> > virtualization solution) with the following config: <br> > <br> > kernel-4.18.0-348.23.1.el8_5.x86_64 <br> > lvm2-2.03.12-11.el8_5.x86_64 <br> > vdo-6.2.5.74-14.el8.x86_64 <br> > kmod-kvdo-6.2.5.72-81.el8.x86_64 <br> > <br> > XFS > VDO > LVM > virtual disk (VirtIO SCSI) <br> > <br> > VDO volume was created using the default config, brief summary: <br> > - logical size 1.2x physical size (based on our past tests on the <br> > stored data) <br> > - compression & deduplication on <br> > - dense index <br> > - write mode async <br> > <br> > It was mounted using the following options: defaults,noatime,logbsize=128k <br> > With discards performed periodically via the fstrim.timer. <br> > <br> > This was stable during all the uptime (including the time since the <br> > whole system creation). <br> > <br> > A few days ago I finally updated it to RL 8.7 as well as converted the <br> > "VDO on LVM" to the new LVM-VDO solution using the lvm_import_vdo <br> > script. The whole process went fine (I already tested it before) and I <br> > ended up with the system running in the desired config. <br> > <br> > kernel-4.18.0-425.3.1.el8.x86_64 <br> > lvm2-2.03.14-6.el8.x86_64 <br> > vdo-6.2.7.17-14.el8.x86_64 <br> > kmod-kvdo-6.2.7.17-87.el8.x86_64 <br> > <br> > The current disk space utilization is around 61% (pretty much the same <br> > for physical as well as for logical space) and it was never close to 80%. <br> > <br> > However it "lasted" for less than a day. During the following night <br> > all operations on the VDO volume hung (the other non-VDO volumes were <br> > still usable) and I had to perform a hard restart in order to get it <br> > back to work. <br> > <br> > The only errors/complaints that I found were the blocked task <br> > notifications in the console as well as in the /var/log/messages log <br> > with the following detail (only the 1st occurrence shown). <br> > <br> > Dec 4 01:53:01 lts1 kernel: INFO: task xfsaild/dm-4:5148 blocked for <br> > more than 120 seconds. <br> > Dec 4 01:53:01 lts1 kernel: Tainted: G OE --------- - <br> > - 4.18.0-425.3.1.el8.x86_64 #1 <br> > Dec 4 01:53:01 lts1 kernel: "echo 0 > <br> > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <br> > Dec 4 01:53:01 lts1 kernel: task:xfsaild/dm-4 state:D stack: 0 <br> > pid: 5148 ppid: 2 flags:0x80004080 <br> > Dec 4 01:53:01 lts1 kernel: Call Trace: <br> > Dec 4 01:53:01 lts1 kernel: __schedule+0x2d1/0x860 <br> > Dec 4 01:53:01 lts1 kernel: ? finish_wait+0x80/0x80 <br> > Dec 4 01:53:01 lts1 kernel: schedule+0x35/0xa0 <br> > Dec 4 01:53:01 lts1 kernel: io_schedule+0x12/0x40 <br> > Dec 4 01:53:01 lts1 kernel: limiterWaitForOneFree+0xc0/0xf0 [kvdo] <br> > Dec 4 01:53:01 lts1 kernel: ? finish_wait+0x80/0x80 <br> > Dec 4 01:53:01 lts1 kernel: kvdoMapBio+0xcc/0x2a0 [kvdo] <br> > Dec 4 01:53:01 lts1 kernel: __map_bio+0x47/0x1b0 [dm_mod] <br> > Dec 4 01:53:01 lts1 kernel: dm_make_request+0x1a9/0x4d0 [dm_mod] <br> > Dec 4 01:53:01 lts1 kernel: generic_make_request_no_check+0x202/0x330 <br> > Dec 4 01:53:01 lts1 kernel: submit_bio+0x3c/0x160 <br> > Dec 4 01:53:01 lts1 kernel: ? bio_add_page+0x46/0x60 <br> > Dec 4 01:53:01 lts1 kernel: _xfs_buf_ioapply+0x2af/0x430 [xfs] <br> > Dec 4 01:53:01 lts1 kernel: ? xfs_iextents_copy+0xba/0x170 [xfs] <br> > Dec 4 01:53:01 lts1 kernel: ? <br> > xfs_buf_delwri_submit_buffers+0x10c/0x2a0 [xfs] <br> > Dec 4 01:53:01 lts1 kernel: __xfs_buf_submit+0x63/0x1d0 [xfs] <br> > Dec 4 01:53:01 lts1 kernel: xfs_buf_delwri_submit_buffers+0x10c/0x2a0 <br> > [xfs] <br> > Dec 4 01:53:01 lts1 kernel: ? xfsaild+0x26f/0x8c0 [xfs] <br> > Dec 4 01:53:01 lts1 kernel: xfsaild+0x26f/0x8c0 [xfs] <br> > Dec 4 01:53:01 lts1 kernel: ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs] <br> > Dec 4 01:53:01 lts1 kernel: kthread+0x10b/0x130 <br> > Dec 4 01:53:01 lts1 kernel: ? set_kthread_struct+0x50/0x50 <br> > Dec 4 01:53:01 lts1 kernel: ret_from_fork+0x1f/0x40 <br> > <br> > I'm now awaiting another occurrence of this and wondering there the <br> > issue may be coming from. <br> > Could it be the new LVM-VDO solution, or the kernel itself? <br> > Can you perhaps suggest how to collect more information in such case, <br> > or provide another tips? <br> > <br> > Best regards, <br> > Petr <br> > <br> > _______________________________________________ <br> > vdo-devel mailing list <br> > <a class="moz-txt-link-abbreviated" href="mailto:vdo-devel@redhat.com">vdo-devel@redhat.com</a> <br> > <a class="moz-txt-link-freetext" href="https://listman.redhat.com/mailman/listinfo/vdo-devel">https://listman.redhat.com/mailman/listinfo/vdo-devel</a> <br> <br> </blockquote> </blockquote> </body> </html>