[PATCH 0/2] Work around broken clang. Again.

Michal Privoznik mprivozn at redhat.com
Wed Mar 15 19:10:26 UTC 2023


We ran into so many clang bugs recently, that I'm strongly in favor of
making it optional and aim on gcc only. Debugging these issues burns our
time needlessly. Just consider what's happening here.

I've pushed patches recently that call some virNuma*() APIs during QEMU
cmd line generation. And while at it, I removed virNuma*() mocks from
qemuxml2argvmock.c (v9.1.0-213-g95ae91fdd4). Andrea told me in review to
make sure our CI works. So I've fired up my freebsd VM and compiled my
code there. It worked. Except, I'm compiling with -O0 and our CI
compiles with the default (-O2). And of course this is the source of the
current build breakage.

Long story short, clang is so eager to produce the fastest code that it
simply ignores written code and generates what it *assumes* behaves the
same. Well, it doesn't and this breaks our mocking and thus our tests.
And of course it's happening only with optimizations.

In this particular case, I've noticed that while
virNumaNodesetIsAvailable() calls virNumaNodeIsAvailable(), it
disregards the return value and virReportError()-s immediately.
This is because the !WITH_NUMACTL variant of virNumaNodeIsAvailable()
calls virNumaGetMaxNode() which fails and thus false is returned.

WORSE, then I stepped in gdb into the mock, I've seen random numbers as
integers. This is because the function from virnumamock.c wasn't
optimized as much and followed standard calling convention (it grabbed
the integer argument from the stack). But the caller
(virNumaNodesetIsAvailable()) was so optimized that it did not even
bothered to push anything onto stack.

After these patches, there is still one qemuxml2argvtest case failing,
and it's simply because no matter how hard I try, I can't stop clang
from optimizing the function. Consider the following code:

  virNumaCPUSetToNodeset(virBitmap *cpuset,
                         virBitmap **nodeset)
  {
      g_autoptr(virBitmap) nodes = virBitmapNew(0);
      ssize_t pos = -1;

      while ((pos = virBitmapNextSetBit(cpuset, pos)) >= 0) {
          int node = virNumaGetNodeOfCPU(pos);

          if (node < 0) {
              virReportSystemError(errno,
                                   _("Unable to get NUMA node of cpu %zd"),
                                   pos);
              return -1;
          }
          ...
      }
      ...
  }

And this is the assembly code that clang generates (even after
VIR_OPTNONE treatment):


  116f5f:       e8 0c b7 f9 ff          call   b2670 <virBitmapNew at plt>
  116f64:       48 89 44 24 40          mov    %rax,0x40(%rsp)
  116f69:       48 c7 44 24 18 ff ff    movq   $0xffffffffffffffff,0x18(%rsp)
  116f70:       ff ff
  116f72:       48 8b 7c 24 38          mov    0x38(%rsp),%rdi
  116f77:       48 8b 74 24 18          mov    0x18(%rsp),%rsi
  116f7c:       e8 0f 74 f9 ff          call   ae390 <virBitmapNextSetBit at plt>
  116f81:       eb 00                   jmp    116f83 <virNumaCPUSetToNodeset+0x43>
  116f83:       48 89 44 24 18          mov    %rax,0x18(%rsp)
  116f88:       48 83 f8 00             cmp    $0x0,%rax
  116f8c:       0f 8c b2 00 00 00       jl     117044 <virNumaCPUSetToNodeset+0x104>
  116f92:       48 8b 7c 24 18          mov    0x18(%rsp),%rdi
  116f97:       e8 d4 a1 f9 ff          call   b1170 <virNumaGetNodeOfCPU at plt>
  116f9c:       c7 44 24 10 ff ff ff    movl   $0xffffffff,0x10(%rsp)
  116fa3:       ff
  116fa4:       83 7c 24 10 00          cmpl   $0x0,0x10(%rsp)
  116fa9:       7d 74                   jge    11701f <virNumaCPUSetToNodeset+0xdf>
  116fab:       e8 80 72 f9 ff          call   ae230 <__errno_location at plt>
  116fb0:       8b 18                   mov    (%rax),%ebx
  116fb2:       48 8d 3d 27 35 2b 00    lea    0x2b3527(%rip),%rdi        # 3ca4e0 <vmdk4GetBackingStore.prefix+0xf260>
  116fb9:       48 8d 35 8c e3 25 00    lea    0x25e38c(%rip),%rsi        # 37534c <modeMap+0x215c>
  116fc0:       ba 05 00 00 00          mov    $0x5,%edx
  116fc5:       e8 e6 a7 f9 ff          call   b17b0 <dcgettext at plt>
  116fca:       48 8b 4c 24 18          mov    0x18(%rsp),%rcx
  116fcf:       48 89 0c 24             mov    %rcx,(%rsp)
  116fd3:       48 8d 15 ea e0 25 00    lea    0x25e0ea(%rip),%rdx        # 3750c4 <modeMap+0x1ed4>
  116fda:       48 8d 0d 54 e3 25 00    lea    0x25e354(%rip),%rcx        # 375335 <modeMap+0x2145>
  116fe1:       31 ff                   xor    %edi,%edi
  116fe3:       89 de                   mov    %ebx,%esi
  116fe5:       41 b8 10 04 00 00       mov    $0x410,%r8d
  116feb:       49 89 c1                mov    %rax,%r9
  116fee:       31 c0                   xor    %eax,%eax
  116ff0:       e8 8b c1 f9 ff          call   b3180 <virReportSystemErrorFull at plt>


My assembler is a bit rusty, but at virNumaGetNodeOfCPU at plt is called
116f97, after which, -1 is stored on a stack (which corresponds to @pos
variable), after which comparison against 0 is made and the jump happens
iff the value (-1) is greater or equal than zero. Otherwise
virReportSystemError() is called.

To reproduce this issue even on Linux, configure with:

  export CC=clang
  meson setup -Dsystem=true -Dexpensive_tests=enabled -Dnumactl=disabled _build


At this point, I think we can call clang broken and focus our time on
developing libvirt. Not coming up with hacks around broken compilers.

Michal Prívozník (2):
  internal.h: Invent VIR_OPTNONE
  virnuma: Annotate some functions as VIR_OPTNONE

 src/internal.h     | 20 ++++++++++++++++++++
 src/util/virnuma.c | 20 ++++++++++----------
 2 files changed, 30 insertions(+), 10 deletions(-)

-- 
2.39.2



More information about the libvir-list mailing list