rpms/kernel/F-11 sched-deal-with-low-load-in-wake-affine.patch, NONE, 1.1 sched-disable-NEW-FAIR-SLEEPERS-for-now.patch, NONE, 1.1 sched-ensure-child-cant-gain-time-over-its-parent-after-fork.patch, NONE, 1.1 sched-remove-shortcut-from-select-task-rq-fair.patch, NONE, 1.1 sched-retune-scheduler-latency-defaults.patch, NONE, 1.1 kernel.spec, 1.1745, 1.1746

Sat Sep 26 20:52:13 UTC 2009

Author: cebbert

Update of /cvs/pkgs/rpms/kernel/F-11
In directory cvs1.fedora.phx.redhat.com:/tmp/cvs-serv4426

Modified Files:
	kernel.spec 
Added Files:
	sched-deal-with-low-load-in-wake-affine.patch 
	sched-disable-NEW-FAIR-SLEEPERS-for-now.patch 
	sched-ensure-child-cant-gain-time-over-its-parent-after-fork.patch 
	sched-remove-shortcut-from-select-task-rq-fair.patch 
	sched-retune-scheduler-latency-defaults.patch 
Log Message:
Scheduler fixes cherry-picked from 2.6.32

sched-deal-with-low-load-in-wake-affine.patch:
 sched_fair.c |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

--- NEW FILE sched-deal-with-low-load-in-wake-affine.patch ---
From: Peter Zijlstra <a.p.zijlstra at chello.nl>
Date: Mon, 7 Sep 2009 16:28:05 +0000 (+0200)
Subject: sched: Deal with low-load in wake_affine()
X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=71a29aa7b600595d0ef373ea605ac656876d1f2f

sched: Deal with low-load in wake_affine()

wake_affine() would always fail under low-load situations where
both prev and this were idle, because adding a single task will
always be a significant imbalance, even if there's nothing
around that could balance it.

Deal with this by allowing imbalance when there's nothing you
can do about it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra at chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo at elte.hu>
---

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index d7fda41..cc97ea4 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1262,7 +1262,17 @@ wake_affine(struct sched_domain *this_sd, struct rq *this_rq,
 	tg = task_group(p);
 	weight = p->se.load.weight;
 
-	balanced = 100*(tl + effective_load(tg, this_cpu, weight, weight)) <=
+	/*
+	 * In low-load situations, where prev_cpu is idle and this_cpu is idle
+	 * due to the sync cause above having dropped tl to 0, we'll always have
+	 * an imbalance, but there's really nothing you can do about that, so
+	 * that's good too.
+	 *
+	 * Otherwise check if either cpus are near enough in load to allow this
+	 * task to be woken on this_cpu.
+	 */
+	balanced = !tl ||
+		100*(tl + effective_load(tg, this_cpu, weight, weight)) <=
 		imbalance*(load + effective_load(tg, prev_cpu, 0, weight));
 
 	/*

sched-disable-NEW-FAIR-SLEEPERS-for-now.patch:
 sched_features.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- NEW FILE sched-disable-NEW-FAIR-SLEEPERS-for-now.patch ---
From: Ingo Molnar <mingo at elte.hu>
Date: Thu, 10 Sep 2009 18:34:48 +0000 (+0200)
Subject: sched: Disable NEW_FAIR_SLEEPERS for now
X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=3f2aa307c4d26b4ed6509d0a79e8254c9e07e921

sched: Disable NEW_FAIR_SLEEPERS for now

Nikos Chantziaras and Jens Axboe reported that turning off
NEW_FAIR_SLEEPERS improves desktop interactivity visibly.

Nikos described his experiences the following way:

  " With this setting, I can do "nice -n 19 make -j20" and
    still have a very smooth desktop and watch a movie at
    the same time.  Various other annoyances (like the
    "logout/shutdown/restart" dialog of KDE not appearing
    at all until the background fade-out effect has finished)
    are also gone.  So this seems to be the single most
    important setting that vastly improves desktop behavior,
    at least here. "

Jens described it the following way, referring to a 10-seconds
xmodmap scheduling delay he was trying to debug:

  " Then I tried switching NO_NEW_FAIR_SLEEPERS on, and then
    I get:

    Performance counter stats for 'xmodmap .xmodmap-carl':

         9.009137  task-clock-msecs         #      0.447 CPUs
               18  context-switches         #      0.002 M/sec
                1  CPU-migrations           #      0.000 M/sec
              315  page-faults              #      0.035 M/sec

    0.020167093  seconds time elapsed

    Woot! "

So disable it for now. In perf trace output i can see weird
delta timestamps:

  cc1-9943  [001]  2802.059479616: sched_stat_wait: task: as:9944 wait: 2801938766276 [ns]

That nsec field is not supposed to be that large. More digging
is needed - but lets turn it off while the real bug is found.

Reported-by: Nikos Chantziaras <realnc at arcor.de>
Tested-by: Nikos Chantziaras <realnc at arcor.de>
Reported-by: Jens Axboe <jens.axboe at oracle.com>
Tested-by: Jens Axboe <jens.axboe at oracle.com>
Acked-by: Peter Zijlstra <a.p.zijlstra at chello.nl>
Cc: Mike Galbraith <efault at gmx.de>
LKML-Reference: <4AA93D34.8040500 at arcor.de>
Signed-off-by: Ingo Molnar <mingo at elte.hu>
---

diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 4569bfa..e2dc63a 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -1,4 +1,4 @@
-SCHED_FEAT(NEW_FAIR_SLEEPERS, 1)
+SCHED_FEAT(NEW_FAIR_SLEEPERS, 0)
 SCHED_FEAT(NORMALIZED_SLEEPER, 0)
 SCHED_FEAT(ADAPTIVE_GRAN, 1)
 SCHED_FEAT(WAKEUP_PREEMPT, 1)

sched-ensure-child-cant-gain-time-over-its-parent-after-fork.patch:
 sched_fair.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

--- NEW FILE sched-ensure-child-cant-gain-time-over-its-parent-after-fork.patch ---
From: Mike Galbraith <efault at gmx.de>
Date: Tue, 8 Sep 2009 09:12:28 +0000 (+0200)
Subject: sched: Ensure that a child can't gain time over it's parent after fork()
X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=b5d9d734a53e0204aab0089079cbde2a1285a38f

sched: Ensure that a child can't gain time over it's parent after fork()

A fork/exec load is usually "pass the baton", so the child
should never be placed behind the parent.  With START_DEBIT we
make room for the new task, but with child_runs_first, that
room comes out of the _parent's_ hide. There's nothing to say
that the parent wasn't ahead of min_vruntime at fork() time,
which means that the "baton carrier", who is essentially the
parent in drag, can gain time and increase scheduling latencies
for waiters.

With NEW_FAIR_SLEEPERS + START_DEBIT + child_runs_first
enabled, we essentially pass the sleeper fairness off to the
child, which is fine, but if we don't base placement on the
parent's updated vruntime, we can end up compounding latency
woes if the child itself then does fork/exec.  The debit
incurred at fork doesn't hurt the parent who is then going to
sleep and maybe exit, but the child who acquires the error
harms all comers.

This improves latencies of make -j<n> kernel build workloads.

Reported-by: Jens Axboe <jens.axboe at oracle.com>
Signed-off-by: Mike Galbraith <efault at gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra at chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo at elte.hu>
---

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index cc97ea4..e386e5d 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -728,11 +728,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 
 			vruntime -= thresh;
 		}
-
-		/* ensure we never gain time by being placed backwards. */
-		vruntime = max_vruntime(se->vruntime, vruntime);
 	}
 
+	/* ensure we never gain time by being placed backwards. */
+	vruntime = max_vruntime(se->vruntime, vruntime);
+
 	se->vruntime = vruntime;
 }
 
@@ -1756,6 +1756,8 @@ static void task_new_fair(struct rq *rq, struct task_struct *p)
 	sched_info_queued(p);
 
 	update_curr(cfs_rq);
+	if (curr)
+		se->vruntime = curr->vruntime;
 	place_entity(cfs_rq, se, 1);
 
 	/* 'curr' will be NULL if the child belongs to a different group */

sched-remove-shortcut-from-select-task-rq-fair.patch:
 sched_fair.c |    2 --
 1 file changed, 2 deletions(-)

--- NEW FILE sched-remove-shortcut-from-select-task-rq-fair.patch ---
From: Peter Zijlstra <a.p.zijlstra at chello.nl>
Date: Mon, 7 Sep 2009 16:12:06 +0000 (+0200)
Subject: sched: Remove short cut from select_task_rq_fair()
X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=cdd2ab3de4301728b20efd6225681d3ff591a938

sched: Remove short cut from select_task_rq_fair()

select_task_rq_fair() incorrectly skips the wake_affine()
logic, remove this.

When prev_cpu == this_cpu, the code jumps straight to the
wake_idle() logic, this doesn't give the wake_affine() logic
the chance to pin the task to this cpu.

Signed-off-by: Peter Zijlstra <a.p.zijlstra at chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo at elte.hu>
---

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 2ff850f..d7fda41 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1305,8 +1305,6 @@ static int select_task_rq_fair(struct task_struct *p, int sync)
 	this_rq		= cpu_rq(this_cpu);
 	new_cpu		= prev_cpu;
 
-	if (prev_cpu == this_cpu)
-		goto out;
 	/*
 	 * 'this_sd' is the first domain that both
 	 * this_cpu and prev_cpu are present in:

sched-retune-scheduler-latency-defaults.patch:
 sched_fair.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- NEW FILE sched-retune-scheduler-latency-defaults.patch ---
From: Mike Galbraith <efault at gmx.de>
Date: Wed, 9 Sep 2009 13:41:37 +0000 (+0200)
Subject: sched: Re-tune the scheduler latency defaults to decrease worst-case latencies
X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=172e082a9111ea504ee34cbba26284a5ebdc53a7

sched: Re-tune the scheduler latency defaults to decrease worst-case latencies

[ cebbert : modified to change the target from 5 to 10 ]

Reduce the latency target from 20 msecs to 5 msecs.

Why? Larger latencies increase spread, which is good for scaling,
but bad for worst case latency.

We still have the ilog(nr_cpus) rule to scale up on bigger
server boxes.

Signed-off-by: Mike Galbraith <efault at gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra at chello.nl>
LKML-Reference: <1252486344.28645.18.camel at marge.simson.net>
Signed-off-by: Ingo Molnar <mingo at elte.hu>
---

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index af325a3..26fadb4 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -24,7 +24,7 @@
 
 /*
  * Targeted preemption latency for CPU-bound tasks:
- * (default: 20ms * (1 + ilog(ncpus)), units: nanoseconds)
+ * (default: 10ms * (1 + ilog(ncpus)), units: nanoseconds)
  *
  * NOTE: this latency value is not the same as the concept of
  * 'timeslice length' - timeslices in CFS are of variable length
@@ -34,13 +34,13 @@
  * (to see the precise effective timeslice length of your workload,
  *  run vmstat and monitor the context-switches (cs) field)
  */
-unsigned int sysctl_sched_latency = 20000000ULL;
+unsigned int sysctl_sched_latency = 10000000ULL;
 
 /*
  * Minimal preemption granularity for CPU-bound tasks:
- * (default: 4 msec * (1 + ilog(ncpus)), units: nanoseconds)
+ * (default: 2 msec * (1 + ilog(ncpus)), units: nanoseconds)
  */
-unsigned int sysctl_sched_min_granularity = 4000000ULL;
+unsigned int sysctl_sched_min_granularity = 2000000ULL;
 
 /*
  * is kept at sysctl_sched_latency / sysctl_sched_min_granularity
@@ -63,13 +63,13 @@ unsigned int __read_mostly sysctl_sched_compat_yield;
 
 /*
  * SCHED_OTHER wake-up granularity.
- * (default: 5 msec * (1 + ilog(ncpus)), units: nanoseconds)
+ * (default: 2.5 msec * (1 + ilog(ncpus)), units: nanoseconds)
  *
  * This option delays the preemption effects of decoupled workloads
  * and reduces their over-scheduling. Synchronous workloads will still
  * have immediate wakeup/sleep latencies.
  */
-unsigned int sysctl_sched_wakeup_granularity = 5000000UL;
+unsigned int sysctl_sched_wakeup_granularity = 2500000UL;
 
 const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
 


Index: kernel.spec
===================================================================
RCS file: /cvs/pkgs/rpms/kernel/F-11/kernel.spec,v
retrieving revision 1.1745
retrieving revision 1.1746
diff -u -p -r1.1745 -r1.1746
--- kernel.spec	26 Sep 2009 18:10:15 -0000	1.1745
+++ kernel.spec	26 Sep 2009 20:52:12 -0000	1.1746
@@ -726,6 +726,14 @@ Patch13000: linux-2.6-kvm-skip-pit-check
 Patch13001: linux-2.6.29-xen-disable-gbpages.patch
 Patch13002: linux-2.6-virtio_blk-dont-bounce-highmem-requests.patch
 
+# sched fixes cherry-picked from 2.6.32
+Patch13100: sched-deal-with-low-load-in-wake-affine.patch
+Patch13101: sched-disable-NEW-FAIR-SLEEPERS-for-now.patch
+Patch13102: sched-ensure-child-cant-gain-time-over-its-parent-after-fork.patch
+Patch13103: sched-remove-shortcut-from-select-task-rq-fair.patch
+# latency defaults from 2.6.32 but changed to be not so aggressive
+Patch13104: sched-retune-scheduler-latency-defaults.patch
+
 Patch14000: make-mmap_min_addr-suck-less.patch
 
 # ----- send for upstream inclusion -----
@@ -1381,6 +1389,14 @@ ApplyPatch linux-2.6.29-xen-disable-gbpa
 # v12n
 ApplyPatch linux-2.6-virtio_blk-dont-bounce-highmem-requests.patch
 
+# sched fixes cherry-picked from 2.6.32
+ApplyPatch sched-deal-with-low-load-in-wake-affine.patch
+ApplyPatch sched-disable-NEW-FAIR-SLEEPERS-for-now.patch
+ApplyPatch sched-ensure-child-cant-gain-time-over-its-parent-after-fork.patch
+ApplyPatch sched-remove-shortcut-from-select-task-rq-fair.patch
+# latency defaults from 2.6.32 but changed to be not so aggressive
+ApplyPatch sched-retune-scheduler-latency-defaults.patch
+
 ApplyPatch make-mmap_min_addr-suck-less.patch
 
 # ----- sent for upstream inclusion -----
@@ -2006,6 +2022,9 @@ fi
 # and build.
 
 %changelog
+* Sat Sep 26 2009  Chuck Ebbert <cebbert at redhat.com>  2.6.30.8-67
+- Scheduler fixes cherry-picked from 2.6.32
+
 * Sat Sep 26 2009  Chuck Ebbert <cebbert at redhat.com>  2.6.30.8-66
 - Backport "appletalk: Fix skb leak when ipddp interface is not loaded"
   (fixes CVE-2009-2903)