head node has an extremely high load average.

Thu Jun 27 14:19:11 UTC 2013

Doll, Margaret Ann wrote:
> On Thu, Jun 27, 2013 at 8:54 AM, Miner, Jonathan W (US SSA) <
> jonathan.w.miner at baesystems.com> wrote:
>>
>> > I installed the iozone program and ran ./iozone -a.
>>
>> iozone allows you to benchmark disk performance and gives you objective
>> measurements.
>>
>> > How does this information help me find the offending program?
>>
>> Not sure you're looking for a "program"... I think you know what program
>> is doing the IO on your client machines,
>
> Users are running gaussian or their own original programs on the compute
> nodes.
> How does one determine from which node the massive io requests are coming?
>
> and we know that "nfsd" is doing the IO on the server, and we know from
>> your previous output that you have high IO wait times.  So... you should
>> be looking at which disks are involved, and why the wait times are so
>> high..
>>
>> Are you using single drives, software raid, hardware raid? What type of
>> bus?
>
> The head node has a single disk for most users' use.  A second disk is
> owned by a single research group which was not involved in the problem.
>
> Most of the compute nodes have a single disk.  There are two compute nodes
> that have a second 700 Gb drive for use with gaussian calculations.  The
> user that caused the io problem was using one of these compute nodes and
> obviously not using the scratch space on the compute node.
>
I have two truly unpleasant thoughts (and yes, we have at least one person
here running Guassians):
   1. What *kind* of h/d are they writing to? They're not, say, WD Caviar
Green?
        We find we can't use them in some servers (mostly Penguins, with
        Supermicro m/b's), because they're "desktop", not "server" drives.
The
        difference is that around '09, all the manufacturers, following WD's
        lead, took out user control of TLRD (I think is the acronym - it's
        how long a head tries before giving up, deciding the sector is bad,
        and writing elsewhere: "desktop" drives will go on for up to 2 min
        or more(!), while servers give up in 6 or 7 *seconds*).
   2. I can document - and I may try again to put file a bugzilla report,
        this time using our institutional account, rather than just as me -
        with NFS as implemented in RHEL 6. The issue is that if you're
reading
        and writing to an NFS mounted drive, and (rw,sync), it's
approximately
        SEVEN TIMES SLOWER than the same in RHEL5.

You can prove (2) to yourself: on an RHEL 5 server, export a directory,
mount it nfs, cd to the mounted one, and untar -xzvf a large file (we've
got one with many directories and files, about 28M tar.gz, and that takes
about a minute or minute and a half); doing the same on RHEL 6 (or CentOS
6, and it takes 6.5 to 7 MINUTES.

The same file, local, takes a second or two.

Note that exporting (rw,async) improves it a lot... but when you're
running serious scientific computing, and the job may run days, or a week
or more, you've got to be concerned.....

        mark