Performance tuning the Fedora Desktop

Mon May 10 16:12:25 UTC 2004

Soeren Sandmann Pedersen wrote:
> Hi Will
> 
> 
>>How well or poorly did the performance tools work in identifying the
>>performance problem?
> 
> 
> I think profiling CPU usage at the desktop level has two important
> properties:
> 
>    1  A call graph is essential
>    2  The data don't have to be very accurate
> 
> Ad 1: The desktop CPU problems are generally algorithmic in nature. The
> big improvements come from fixing O(n^2) algorithms and from adding
> caching and other high-level optimizations. To do this it is essential
> to know *why* something time-consuming is being done, so that you can in
> the best case change the algorithm to not actually do it anymore.

The algorithms selected have a huge impact on performance. However, it 
is not always clear that the algorithm selected is wrong until the code 
is used. Data structures have different strengths, e.g. cheap to index 
and fetch from an array, but it expensive to insert elements into 
beginning of array.

> Ad 2: Since you are working on high-level optimizations, you need to
> know stuff like "30% in metacity" and get a rough break-down of those
> 30%. The profiler must not be so intrusive that the applications become 
> unusable, but slightly skewed data is not a disaster.

Yes, low overhead is more important than absolute accuracy. I think for 
right now the tuning is looking for the "low hanging fruit". Whether the 
profiler says that something take 30% or 33% is not going to make a big 
difference. For the most part just want to point out the major resource 
hogs. It would painful for users of the GUI on the desktop to be slowed 
by emulation, plus users might do things different if the speed is too 
different.

> This high-level optimization is in contrast to tuning of inner loops,
> where the properties are reversed:
> 
>    1  In which function do we spend the time
>    2  What, exactly, is the CPU doing. You want to know about 
>       cache misses and divisions and branch predictions and such
>       things. You want to know in what lines of source code the time
>       is spent.
> 
> In this case you generally don't try to stop doing it, you try to do it
> faster.

OProfile can certainly provide information on cache misses, branch 
predictions, and other performance monitoring events.

> The sysprof profiler, which can be checked out of GNOME cvs, is clearly
> aiming at the first kind of profiling.
> 
> Sysprof works with a kernel module that 50 times per second generates a
> stacktrace of the process in the "current" variable, unless the pid of
> that process is 0. A userspace application then reads those stacktraces
> and presents the information graphically in lists and trees.

The oprofile support in Fedora Core 2 test3 has a similar mechanism to 
walk to the stack, but it typically uses the performance monitoring 
hardware to trigger the sampling. It only works for x86 (other 
processors do not include frame pointers). You might want to take a look 
at it. It won't work for hugemem kernels because there are separate 
address spaces for user and kernel mode, but I imagine for most desktop 
work people are not using hugemem kernels.

On Pentium4 and Pentium M there are performance monitoring events that 
count calls, so the sampling can be done based on the number of calls. 
This may be more desirable than a time-based samples.

However, one drawback of this statistical call grap information is one 
ends up with a call graph forest rather than a call graph tree. The 
sampling will miss the lone call that causes a lot of work unless the 
code happens to walk far enough up the stack. Does the sysprof stack 
tracer you use walked the entire user stack each time it takes a sample?

> So it is a statistical, sampling profiler. The kernel code probably
> reveals that I am not an experienced kernel hacker. Generally I worked
> from various driver writing guides I found on the net, and I consider it
> quite likely to break on more exotic kernels, where "exotic" means
> different from mine.
> 
> Its killer feature I think is the presentation of the data. For each
> function you can get a complete break-down of the children in which that
> function spends its time. This even works with recursion, including
> mutual recursion. Generally it never reports a function as calling
> itself, instead it combines the numbers correctly. The not completely
> trivial details would make this mail much longer.
> 
> That you can change the view of the data quickly makes it possible to
> get a good high-level overview of the performance characteristics of the
> system.
> 
> A different property sysprof has is that it is fairly easy to get
> running. Just install a kernel module and start the application and you
> are set. I found oprofile a bit more difficult to get started with.

oprofile has been more difficult to set up in the past. However, pretty 
much one can just install an RH smp kernel, boot the RH smp kernel, 
"opcontrol --setup --no-vmlinux; opcontrol --start", and one has 
profiling for user code. There is still room for improvement.

> It seems to me that since oprofile probably reports more and better data
> than my kernel module, we should try and get the graphical presentation
> from sysprof to present oprofile data. It shouldn't be too difficult to
> do this; the presentation code was lifted from the memprof/speedprof
> profiler and is quite independent of the rest of the profiler. (Actually
> you could argue that the presentation code pretty much _is_ the entire
> profiler).

I will take a look at the sysprof to see how it presents data.

> Another thing that might be nice is a library that would allow symbol
> lookup in binaries. I spent quite a bit of time whacking the memprof
> code to deal with prelinked binaries, and I am not too confident I got
> it completely right.
> 
> 
> Soeren

Thanks for the comments.

-Will