[vfio-users] lspci and vfio_pci_release deadlock when destroy a pci passthrough VM
Wuzongyong (Euler Dept)
cordius.wu at huawei.com
Wed Mar 20 13:32:33 UTC 2019
Hi Alex,
I notice a patch you pushed in https://lkml.org/lkml/2019/2/18/1315
You said the previous commit you pushed may prone to deadlock, could you please share the details about how to reproduce the deadlock scene if you know it.
I met a similar question that all lspci command went into D state and libvirtd went into Z state when destroy a VM with a GPU passthrou. The stack like that:
2019-03-20T13:37:14.726514+07:00|err|kernel[-]|[2427373.553663] INFO: task ps:112058 blocked for more than 120 seconds.
2019-03-20T13:37:14.726576+07:00|err|kernel[-]|[2427373.553667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-03-20T13:37:14.726599+07:00|info|kernel[-]|[2427373.553669] ps D 0000000000000000 0 112058 1 0x00000004
2019-03-20T13:37:14.726620+07:00|warning|kernel[-]|[2427373.553673] Call Trace:
2019-03-20T13:37:14.726640+07:00|warning|kernel[-]|[2427373.553682] [<ffffffff816b7069>] schedule_preempt_disabled+0x29/0x70
2019-03-20T13:37:14.726668+07:00|warning|kernel[-]|[2427373.553684] [<ffffffff816b4a21>] __mutex_lock_slowpath+0xe1/0x170
2019-03-20T13:37:14.726689+07:00|warning|kernel[-]|[2427373.553689] [<ffffffff816b400f>] mutex_lock+0x1f/0x2f
2019-03-20T13:37:14.726707+07:00|warning|kernel[-]|[2427373.553695] [<ffffffff81379337>] pci_bus_save_and_disable+0x37/0x70
2019-03-20T13:37:14.726725+07:00|warning|kernel[-]|[2427373.553697] [<ffffffff8137aeb8>] pci_try_reset_bus+0x38/0x80
2019-03-20T13:37:14.726743+07:00|warning|kernel[-]|[2427373.553730] [<ffffffffa0261045>] vfio_pci_release+0x3d5/0x430 [vfio_pci]
2019-03-20T13:37:14.726761+07:00|warning|kernel[-]|[2427373.553737] [<ffffffffa0260640>] ? vfio_pci_rw+0xc0/0xc0 [vfio_pci]
2019-03-20T13:37:14.726779+07:00|warning|kernel[-]|[2427373.553745] [<ffffffffa02529f2>] vfio_device_fops_release+0x22/0x40 [vfio]
2019-03-20T13:37:14.726798+07:00|warning|kernel[-]|[2427373.553751] [<ffffffff812179dc>] __fput+0xec/0x260
2019-03-20T13:37:14.726821+07:00|warning|kernel[-]|[2427373.553754] [<ffffffff81217c8e>] ____fput+0xe/0x10
2019-03-20T13:37:14.726840+07:00|warning|kernel[-]|[2427373.553758] [<ffffffff810b684a>] task_work_run+0xaa/0xe0
2019-03-20T13:37:14.726858+07:00|warning|kernel[-]|[2427373.553763] [<ffffffff8102ac12>] do_notify_resume+0x92/0xb0
2019-03-20T13:37:14.726876+07:00|warning|kernel[-]|[2427373.553767] [<ffffffff816c264f>] int_signal+0x12/0x17
2019-03-20T13:37:14.726892+07:00|err|kernel[-]|[2427373.553771] INFO: task lspci:139540 blocked for more than 120 seconds.
2019-03-20T13:37:14.726910+07:00|err|kernel[-]|[2427373.553772] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-03-20T13:37:14.726929+07:00|info|kernel[-]|[2427373.553773] lspci D 0000000000000000 0 139540 139539 0x00000000
2019-03-20T13:37:14.726948+07:00|warning|kernel[-]|[2427373.553776] Call Trace:
2019-03-20T13:37:14.726970+07:00|warning|kernel[-]|[2427373.553778] [<ffffffff816b5f79>] schedule+0x29/0x70
2019-03-20T13:37:14.726989+07:00|warning|kernel[-]|[2427373.553782] [<ffffffff81370ca0>] pci_wait_cfg+0xa0/0x110
2019-03-20T13:37:14.727006+07:00|warning|kernel[-]|[2427373.553787] [<ffffffff810cfe40>] ? wake_up_state+0x20/0x20
2019-03-20T13:37:14.727023+07:00|warning|kernel[-]|[2427373.553790] [<ffffffff81370e15>] pci_user_read_config_dword+0x105/0x110
2019-03-20T13:37:14.727043+07:00|warning|kernel[-]|[2427373.553794] [<ffffffff8137e974>] pci_read_config+0x114/0x2c0
2019-03-20T13:37:14.727063+07:00|warning|kernel[-]|[2427373.553799] [<ffffffff811f4835>] ? __kmalloc+0x55/0x240
2019-03-20T13:37:14.727084+07:00|warning|kernel[-]|[2427373.553804] [<ffffffff812992fe>] read+0xde/0x1f0
2019-03-20T13:37:14.727103+07:00|warning|kernel[-]|[2427373.553807] [<ffffffff81215a5f>] vfs_read+0x9f/0x170
2019-03-20T13:37:14.727123+07:00|warning|kernel[-]|[2427373.553809] [<ffffffff81216812>] SyS_pread64+0x92/0xc0
2019-03-20T13:37:14.727141+07:00|warning|kernel[-]|[2427373.553812] [<ffffffff816c22ef>] system_call_fastpath+0x1c/0x21
It seems that lspci and vfio_pci_release are in deadlock.
Thanks,
Zongyong Wu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/vfio-users/attachments/20190320/90583451/attachment.htm>
More information about the vfio-users
mailing list