[Crash-utility] [PATCH] Fix bugs in runq
Zhang Yanfei
zhangyanfei at cn.fujitsu.com
Sat Aug 25 03:23:32 UTC 2012
于 2012年08月25日 02:17, Dave Anderson 写道:
>
>
> ----- Original Message -----
>>
>>
>> ----- Original Message -----
>>> Hello Dave,
>>>
>>> In runq command, when dumping cfs and rt runqueues,
>>> it seems that we get the wrong nr_running values of rq
>>> and cfs_rq.
>>>
>>> Please refer to the attached patch.
>>>
>>> Thanks
>>> Zhang Yanfei
>>
>> Hello Zhang,
>>
>> I understand what you are trying to accomplish with this patch, but
>> none of my test dumpfiles can actually verify it because there is no
>> difference with or without your patch. What failure mode did you see
>> in your testing? I presume that it just showed "[no tasks queued]"
>> for the RT runqueue when there were actually tasks queued there?
>>
>> The reason I ask is that I'm thinking that a better solution would
>> be to simplify dump_CFS_runqueues() by *not* accessing and using
>> rq_nr_running, cfs_rq_nr_running or cfs_rq_h_nr_running.
>>
>> Those counters are only read to determine the "active" argument to
>> pass to dump_RT_prio_array(), which returns immediately if it is
>> FALSE. However, if we get rid of the "active" argument and simply
>> allow dump_RT_prio_array() to always check its queues every time,
>> it still works just fine.
>>
>> For example, I tested my set of sample dumpfiles with this patch:
>>
>> diff -u -r1.205 task.c
>> --- task.c 12 Jul 2012 20:04:00 -0000 1.205
>> +++ task.c 22 Aug 2012 15:33:32 -0000
>> @@ -7636,7 +7636,7 @@
>> OFFSET(cfs_rq_tasks_timeline));
>> }
>>
>> - dump_RT_prio_array(nr_running != cfs_rq_nr_running,
>> + dump_RT_prio_array(TRUE,
>> runq + OFFSET(rq_rt) + OFFSET(rt_rq_active),
>> &runqbuf[OFFSET(rq_rt) +
>> OFFSET(rt_rq_active)]);
>>
>> and the output is identical to testing with, and without, your patch.
>>
>> So the question is whether dump_CFS_runqueues() should be needlessly
>> complicated with all of the "nr_running" references?
>>
>> In fact, it also seems possible that a crash could happen at a point in
>> the scheduler code where those counters are not
>> valid/current/trustworthy.
>>
>> So unless you can convince me otherwise, I'd prefer to just remove
>> the "nr_running" business completely.
>
> Hello Zhang,
>
> Here's the patch I've got queued, which resolves the bug you encountered
> by simplifying things:
>
OK. I see.
And based on this patch, I made a new patch to solve the problem when
dumping rt runqueues. Currently dump_RT_prio_array() doesn't support
rt group scheduler.
In my test, I put some rt tasks into one group, just like below:
mkdir /cgroup/cpu/test1
echo 850000 > /cgroup/cpu/test1/cpu.rt_runtime_us
./rtloop1 &
echo $! > /cgroup/cpu/test1/tasks
./rtloop1 &
echo $! > /cgroup/cpu/test1/tasks
./rtloop1 &
echo $! > /cgroup/cpu/test1/tasks
./rtloop98 &
echo $! > /cgroup/cpu/test1/tasks
./rtloop45 &
echo $! > /cgroup/cpu/test1/tasks
./rtloop99 &
echo $! > /cgroup/cpu/test1/tasks
Using crash to analyse the vmcore:
crash> runq
CPU 0 RUNQUEUE: ffff880028216680
CURRENT: PID: 5125 TASK: ffff88010799d540 COMMAND: "sh"
RT PRIO_ARRAY: ffff880028216808
[ 0] PID: 5136 TASK: ffff8801153cc040 COMMAND: "rtloop99"
PID: 6 TASK: ffff88013d7c6080 COMMAND: "watchdog/0"
PID: 3 TASK: ffff88013d7ba040 COMMAND: "migration/0"
[ 1] PID: 5134 TASK: ffff8801153cd500 COMMAND: "rtloop98"
PID: 5135 TASK: ffff8801153ccaa0 COMMAND: "rtloop98"
CFS RB_ROOT: ffff880028216718
[120] PID: 5109 TASK: ffff880037923500 COMMAND: "sh"
[120] PID: 5107 TASK: ffff88006eeccaa0 COMMAND: "sh"
[120] PID: 5123 TASK: ffff880107a4caa0 COMMAND: "sh"
CPU 1 RUNQUEUE: ffff880028296680
CURRENT: PID: 5086 TASK: ffff88006eecc040 COMMAND: "bash"
RT PRIO_ARRAY: ffff880028296808
[ 0] PID: 5137 TASK: ffff880107b35540 COMMAND: "rtloop99"
PID: 10 TASK: ffff88013cc2cae0 COMMAND: "watchdog/1"
PID: 2852 TASK: ffff88013bd5aae0 COMMAND: "rtkit-daemon"
[ 54] CFS RB_ROOT: ffff880028296718
[120] PID: 5115 TASK: ffff8801152b1500 COMMAND: "sh"
[120] PID: 5113 TASK: ffff880139530080 COMMAND: "sh"
[120] PID: 5111 TASK: ffff88011bd86080 COMMAND: "sh"
[120] PID: 5121 TASK: ffff880115a9e080 COMMAND: "sh"
[120] PID: 5117 TASK: ffff8801152b0040 COMMAND: "sh"
[120] PID: 5119 TASK: ffff880115a9eae0 COMMAND: "sh"
We can see that the output is kind of incorrect.
After applying the attached patch, crash seems to work well:
crash> runq
CPU 0 RUNQUEUE: ffff880028216680
CURRENT: PID: 5125 TASK: ffff88010799d540 COMMAND: "sh"
RT PRIO_ARRAY: ffff880028216808
[ 0] PID: 5136 TASK: ffff8801153cc040 COMMAND: "rtloop99"
CHILD RT PRIO_ARRAY: ffff88013b050000
[ 0] PID: 5133 TASK: ffff88010799c080 COMMAND: "rtloop99"
[ 1] PID: 5131 TASK: ffff880037922aa0 COMMAND: "rtloop98"
[ 98] PID: 5128 TASK: ffff88011bd87540 COMMAND: "rtloop1"
PID: 5130 TASK: ffff8801396e7500 COMMAND: "rtloop1"
PID: 5129 TASK: ffff88011bf5a080 COMMAND: "rtloop1"
PID: 6 TASK: ffff88013d7c6080 COMMAND: "watchdog/0"
PID: 3 TASK: ffff88013d7ba040 COMMAND: "migration/0"
[ 1] PID: 5134 TASK: ffff8801153cd500 COMMAND: "rtloop98"
PID: 5135 TASK: ffff8801153ccaa0 COMMAND: "rtloop98"
CFS RB_ROOT: ffff880028216718
[120] PID: 5109 TASK: ffff880037923500 COMMAND: "sh"
[120] PID: 5107 TASK: ffff88006eeccaa0 COMMAND: "sh"
[120] PID: 5123 TASK: ffff880107a4caa0 COMMAND: "sh"
CPU 1 RUNQUEUE: ffff880028296680
CURRENT: PID: 5086 TASK: ffff88006eecc040 COMMAND: "bash"
RT PRIO_ARRAY: ffff880028296808
[ 0] PID: 5137 TASK: ffff880107b35540 COMMAND: "rtloop99"
PID: 10 TASK: ffff88013cc2cae0 COMMAND: "watchdog/1"
PID: 2852 TASK: ffff88013bd5aae0 COMMAND: "rtkit-daemon"
[ 54] CHILD RT PRIO_ARRAY: ffff880138978000
[ 54] PID: 5132 TASK: ffff88006eecd500 COMMAND: "rtloop45"
CFS RB_ROOT: ffff880028296718
[120] PID: 5115 TASK: ffff8801152b1500 COMMAND: "sh"
[120] PID: 5113 TASK: ffff880139530080 COMMAND: "sh"
[120] PID: 5111 TASK: ffff88011bd86080 COMMAND: "sh"
[120] PID: 5121 TASK: ffff880115a9e080 COMMAND: "sh"
[120] PID: 5117 TASK: ffff8801152b0040 COMMAND: "sh"
[120] PID: 5119 TASK: ffff880115a9eae0 COMMAND: "sh"
Is this kind of output for rt runqueues ok? Or do you have any suggestion?
Thanks
Zhang Yanfei
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 0001-Fix-rt-not-support-group-sched-bug.patch
URL: <http://listman.redhat.com/archives/crash-utility/attachments/20120825/dee76333/attachment.ksh>
More information about the Crash-utility
mailing list