Big write delays

Fri Jun 24 19:25:23 UTC 2011

        I have an Application with big write delays and need some help
determining what may be causing it. The Application uses a proprietary
MUMPS database, not a relational database manager like Oracle. Let me
first explain the architecture a little.

        There is a buffer pool in Shared Memory. When a user process
needs a record, it first searches the Shared Memory buffer pool for the
block needed. If not found it allocates a buffer in Shared Memory and
reads the block from the disk to the buffer. This way it can be accessed
by other processes so they do not have to do the physical read again. If
the block is modified, the user process does not do the write, it just
leaves the block in the buffer pool and another support process (disk
writer) will write it later. So the user process only has to wait on
reads but not writes.

        If the user process has to modify the block and it has not been
Before Imaged since a point in time, then the block is copied to another
buffer that will be written to the Before Image Log by another support
process (bil writer). There are a limited number of buffers for this
use.

        The user process will also put the transaction that caused the
block change in a journal buffer, which gets written by another support
process (jnl writer). There are a limited number of buffers for this
use.

        A little bit about the hardware (disk) layout. There is an HBA
Raid Array where the OS, Swap and other file systems are located. The
database data, BIL and JNL is stored directly on Logical Volumes in a
Volume Group that has the Physical Volumes as LUNs on a SAN. So there is
no file system on the LV, just direct reads and writes to the LV. Writes
are done with the standard write system call, followed by a calling
fdatasync, which causes the writer process to wait until the block is
truly on the disk, well at least the SAN has accepted the block. The
write and fdatasync normally take less than 0.0004 seconds on average.

        Now for the problem. Many times when a large file (1 to 2 G) is
written to the local disk the writes/fdatasync suffer big time, from
several seconds to several minutes at times. When this happens, the
limited number of BIL and JNL buffers fills up and the Application user
processes have to wait for them to be written before it can complete a
transaction. This makes it seem like the Application has locked up, well
basically it has because it is waiting on a resource.

        I don't understand how the I/O on the local disk is affecting
the I/O going to the SAN. They are using different HBAs and unrelated to
the Application, that is the I/O on the local disk is in no way using
the Shared Memory the Application is using, so there should not be any
memory page locking and such going on between them. The problem can be
cause just by doing something like "cp /tmp/bigfile /var/tmp/bigfile"
where /tmp and /var are on different file systems, but both are in the
same VG on the local disk, which is a different VG than the database LVs
are in.

        Running vmstat 5 while this is happen shows a few blocks being
written, most likely the big file that was copied that is mostly located
in the kernel buffers aging and being flushed to disk.

        Most systems have 4G or more of Physical Memory and when this
happens there is still a good bit of free memory and nothing gets paged
out. So I don't see it as an overall low memory problem. And Shared
Memory has been locked in Physical Memory. When this first showed up as
a problem I did some research and found that Shared Memory is one of the
first things to get paged out. This is the reason it is now locked in
Physical Memory, but it did not really help.

        The Application is running on RHEL 5.6 and this happens on
various hardware from Dell, HP and IBM. All use different local disk HBA
Raid controllers. I can provide more details about the exact kernel
version and other things if needed.

        I know this has been long, but I hope you all will take the time
to read all this and be able to make some good suggest at to what may be
causing this problem. 

-----

Jack Allen