[Crash-utility] [PATCH] fix the accounting of RSS when SPLIT_RSS_ACCOUNTING is enabled

Mon Oct 28 20:06:19 UTC 2013

----- Original Message -----
> Hi Dave,
> 
> It was noticed that when SPLIT_RSS_ACCOUNTING is enabled in kernel, the RSS
> values shown by ps command is huge (negative value represented as ulong)
> for certain tasks. Shown below is a ps output of such kind.
> 
>    1416      1   3  db506ac0  IN  36235.9     936 4194128  debuggerd
>    1417      1   3  db500040  IN   0.0   16512     36  rild
>    1418      1   1  db500580  IN   0.1   47160    488  surfaceflinger
>    1419      1   0  db74d040  IN   1.1  458544   5148  zygote
> *   1420      1   1  db5d3ac0  IN  36235.9   13868 4194280  drmserver*
>    1421      1   1  db5d3580  IN   0.0   54944     20  mediaserver
>    1422      1   0  db7c9580  IN   0.0     900     20  installd
>  *  1423      1   0  db7c9040  IN  36235.8    1828 4194112  keystore
>    1424      1   3  db5d8580  IN  36235.9    1152 4194276  akmdfs
>    1426      1   3  db7bd040  IN  36235.9   10880 4194212  glgps
>    1428      1   0  db61c040  IN  36235.9    2084 4194152  usb_portd
>    1429      1   0  db753ac0  IN  36235.9     868 4194148  atxd_proxy
>    1431      1   1  da6fbac0  UN  36235.9    6100 4194280  bkmgrd*
> *
> *
> When SPLIT_RSS_ACCOUNTING is enabled, the rss values are stored in
> task_struct of each thread, and is synced to the mm_struct.rss_stat only at
> certain events. So during a crash it is likely that mm.rss_stat contains
> old values which can even be negative.
> I have prepared a patch (attached) to sync the task_struct.rss_stat with
> mm.rss_stat, in get_task_mem_usage, when SPLIT_RSS_ACCOUNTING is enabled.
> 
> Please share your thoughts on this.

Hi Vinayak,

Good catch -- it certainly should be fixed.  The devil is in the details of how
best to accomplish it...

On the face of it, it would seem that the task group list gathering via do_list()
would be the best way.  The problem is that it could potentially cause the "ps"
command to fail on live systems if the list becomes invalid, or is invalid with
respect to the pre-gathered set of tasks that get captured prior to execution
of the command.

Prior to each crash command whose command_table_entry structure contains the
REFRESH_TASK_TABLE flag (as does "ps"), the tt->refresh_task_table() function
is called to scan the kernel's pid_hash[] for a stable set of tasks.  And only
those tasks are accessed or used by that crash command instance.   

By calling do_list(), it's conceivable (although fairly unlikely) that the command
could go either go off into the weeds following a list that's being modified, or 
if the list has changed since the "stable" list was gathered prior to the command.
So preferably, it would be best if the new functionality could be constrained to 
the stable set of pre-gathered tasks -- again if for no other reason than the fact
that the crash utility only uses the pre-gathered task-set whenever it parses tasks.  

There's also another potential problem in having get_task_mem_usage() use 
open_tmpfile().  Since it's a function that gets called by other functions
or macros (IN_TASK_VMA() for example), it's possible that the caller may
also be operating inside an open_tmpfile() area, and open_tmpfile() is not
recursive.  In those internal cases, open_tmpfile2() could be used, but again,
it's probably best to avoid the do_list() call.

So I tinkered with an optional manner of checking the thread group list, i.e.,
like this:

	if (VALID_MEMBER(task_struct_rss_stat)) {
	        int i, cnt;
		int sync_rss;
	        ulong tgid;
	        struct task_context *tc1;

	        tgid = task_tgid(task);

	        tc1 = FIRST_CONTEXT();
	        for (i = cnt = 0; i < RUNNING_TASKS(); i++, tc1++) {
	                if (task_tgid(tc1->task) != tgid)
				continue;

			/* count 0 -> filepages */
			if (!readmem(tc1->task +
				OFFSET(task_struct_rss_stat) +
				OFFSET(task_rss_stat_count), KVADDR,
				&sync_rss,
				sizeof(int), "task_struct rss_stat 0",
				RETURN_ON_ERROR))
					continue;

			rss += sync_rss;

			/* count 1 -> anonpages */
			if (!readmem(tc1->task +
				OFFSET(task_struct_rss_stat) +
				OFFSET(task_rss_stat_count) + sizeof(int),
				KVADDR, &sync_rss,
				sizeof(int), "task_struct rss_stat 0",
				RETURN_ON_ERROR))
					continue;

			rss += sync_rss;
		}
	}

It's inefficient in comparison to using do_list(), but it does guarantee
safety on live systems.

A couple other minor issues --  the two readmem() type strings above both
indicate "rss_stat 0", so they should be differentiated with strings indicating
"task_struct rss_stat MM_FILEPAGES" and "task_struct rss_stat MM_ANONPAGES".

And lastly, since MEMBER_EXISTS() and MEMBER_OFFSET_INIT() do essentially
the same thing, it simpler in this case to just call MEMBER_OFFSET_INIT()
the 3 times without any MEMBER_EXISTS() conditionalization:

+	if (MEMBER_EXISTS("task_struct", "thread_group"))
+		MEMBER_OFFSET_INIT(task_struct_thread_group, "task_struct",
+			"thread_group");
+
+	if (MEMBER_EXISTS("task_struct", "rss_stat")) {
+		MEMBER_OFFSET_INIT(task_struct_rss_stat, "task_struct",
+			"rss_stat");
+		MEMBER_OFFSET_INIT(task_rss_stat_count, "task_rss_stat",
+			"count");
+	}
+

If the members don't exist, they will just get set to INVALID_OFFSET (-1).

What do you think?

Dave