[PATCH 6/7] util: Check for pkttyagent availability properly

Jim Fehlig jfehlig at suse.com
Mon Dec 13 16:42:04 UTC 2021


On 12/12/21 14:15, Martin Kletzander wrote:
> On Sun, Dec 12, 2021 at 10:40:46AM -0700, Jim Fehlig wrote:
>> On 12/11/21 03:28, Martin Kletzander wrote:
>>> On Sat, Dec 11, 2021 at 11:16:13AM +0100, Martin Kletzander wrote:
>>>> On Fri, Dec 10, 2021 at 05:48:03PM -0700, Jim Fehlig wrote:
>>>>> Hi Martin!
>>>>>
>>>>> I recently received a bug report (sorry, not public) about simple operations
>>>>> like 'virsh list' hanging when invoked with an internal test tool. I found 
>>>>> this
>>>>> commit to be the culprit.
>>>>>
>>>
>>> OK, one more thing though, the fact that pkttyagent is spawned cannot
>>> cause virsh to hang.  If the authentications is not required, then it
>>> will just wait there for a while and then be killed.  If authentication
>>> *is* required, then either you already have an agent running and that
>>> one should be used since we're starting pkttyagent with `--fallback` or
>>> you do not have any agent running in which case virsh list would fail
>>> to connect.  Where does the virsh hang, what's the backtrace?
>>
>> The last scenario you describe appears to be the case. virsh fails to connect
>> then gets stuck trying to kill off pkttyagent
>>
>> #0  0x00007f9f07530241 in clock_nanosleep at GLIBC_2.2.5 () from /lib64/libc.so.6
>> #1  0x00007f9f07535ad3 in nanosleep () from /lib64/libc.so.6
>> #2  0x00007f9f07f478af in g_usleep () from /usr/lib64/libglib-2.0.so.0
>> #3  0x00007f9f086694fa in virProcessAbort (pid=367) at 
>> ../src/util/virprocess.c:187
>> #4  0x00007f9f0861ed9b in virCommandAbort (cmd=cmd at entry=0x55a798660c50) at
>> ../src/util/vircommand.c:2774
>> #5  0x00007f9f08621478 in virCommandFree (cmd=0x55a798660c50) at
>> ../src/util/vircommand.c:3061
>> #6  0x00007f9f08668581 in virPolkitAgentDestroy (agent=0x55a7986426e0) at
>> ../src/util/virpolkit.c:164
>> #7  0x000055a797836d93 in virshConnect (ctl=ctl at entry=0x7ffc551dd980, uri=0x0,
>> readonly=readonly at entry=false) at ../tools/virsh.c:187
>> #8  0x000055a797837007 in virshReconnect (ctl=ctl at entry=0x7ffc551dd980,
>> name=name at entry=0x0, readonly=<optimized out>, readonly at entry=false,
>> force=force at entry=false) at ../tools/virsh.c:223
>> #9  0x000055a7978371e0 in virshConnectionHandler (ctl=0x7ffc551dd980) at
>> ../tools/virsh.c:325
>> #10 0x000055a797880172 in vshCommandRun (ctl=ctl at entry=0x7ffc551dd980,
>> cmd=0x55a79865f580) at ../tools/vsh.c:1308
>> #11 0x000055a7978367b7 in main (argc=2, argv=<optimized out>) at
>> ../tools/virsh.c:907
>>
>> Odd thing is, I attached gdb to this virsh process several minutes after
>> invoking the test tool that calls 'virsh list'. I can't explain why the process
>> is still blocked in g_usleep, which should only have slept for 10 milliseconds.
>> Even odder, detaching from the process appears to awaken g_usleep and allows
>> process shutdown to continue. The oddness can also be seen in the debug output
>>
>> 2021-12-12 16:35:38.783+0000: 5912: debug : virCommandRunAsync:2629 : About to
>> run /usr/bin/pkttyagent --process 5912 --notify-fd 4 --fallback
>>
>> 2021-12-12 16:35:38.787+0000: 5912: debug : virCommandRunAsync:2632 : Command
>> result 0, with PID 5914
>>
>> ...
>> 2021-12-12 16:35:38.830+0000: 5912: debug : virProcessAbort:177 : aborting child
>> process 5914
>>
>> 2021-12-12 16:35:38.830+0000: 5912: debug : virProcessAbort:185 : trying SIGTERM
>> to child process 5914
>>
>>
>> Attach gdb to the process, observe above backtrace, quit gdb.
>>
>> 2021-12-12 16:44:18.059+0000: 5912: debug : virProcessAbort:195 : trying SIGKILL
>> to child process 5914
>>
>> 2021-12-12 16:44:18.061+0000: 5912: debug : virProcessAbort:201 : process has
>> ended: fatal signal 9
>>
> 
> The stuck g_usleep() is weird.  Isn't there a tremendous load on the
> machine?  I can't think of much else.

The machine is mostly idle.

> Also I am looking at the pkttyagent code and it looks like it blocks the
> first SIGTERM and sending two of them should help in that case, but if
> we want to wait for few ms between them, than we;ll be in the same
> pickle.
> 
>>
>>> Anyway, if just adding:
>>>
>>>      if (!isatty(STDIN_FILENO))
>>>          return false;
>>
>> This indeed fixes the regression in the test tool.
>>
> 
> That just means that it won't start the agent.  Let's do this, but I
> would really, *really* like to figure out what the heck is happening
> there, because there has to be something wrong and it might just be
> waiting around the corner for us and bite us in the back in a year or
> so.  Although I understand how improbable that is.

Do you have additional suggestions that may help us gain a better understanding 
of the problem?

Regards,
Jim





More information about the libvir-list mailing list