[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: writing processes are blocking in log_wait_common with data=ordered

> Andrew Morton wrote:
> > 
> > Does this patch help?
> I won't, I suspect.  You've done an O_SYNC write.  ext3
> needs to write your data out to disk before returning
> from the pwrite() call.  We do that by running a commit
> and waiting for it to complete.
> In ordered mode, commit will writeback and wait upon
> your newly-dirtied data.  That's what you asked it to do.
> Other filesystems will do it by directly writing the data
> and waiting on it.  We've lost some concurrency because
> the journal is busy, but in practice I suspect it won't
> make much difference.
> Are you sure that you actually have a problem?  Does your
> application run significantly more quickly on ext2?

I think so.  Here's what I've tested so far using a test program
(attached, see P.S. below) that simulates the load.  I have:

1) Red Hat 7.2. kernel 2.4.17-rc2-aa2, with ext3 on a ATA133 disk.  
This reports about 70 blks/sec.

2) Red Hat 6.2 kernel 2.4.17-rc2-aa2 with ext2 on a SCSI U160 disk. This
reports about 420 blks/sec.

3) Red Hat 7.2 (identical hardware to #2) kernel 2.4.19-pre7-aa2 with ext3
This reports about 40 blks/sec.

Both ext3 systems are in the 40-70 range, though they differ in kernel
version and hardware.  The ext2 system is 10x faster, even on the same
kernel or hardware.

Also, kjournald has been eating a ton of cpu time lately.  It had used 7
minutes in a month and then 3 minutes in a day since I noticed this was
happening.  This is with the real application, not the test proggy.

> (I now need to know your exact kernel version - there
> have been various goofups on the sync paths which were
> fixed relatively recently).
> I suspect that ext3 is doing an unnecessary commit
> on the fsync() case, and in the O_SYNC case, for your
> application.  If the mtime fix is in place then we
> can try to drop all the ordered-mode data buffers
> from the transaction (which will succeed) and then
> look to see if there's anything to be committed
> (there will not be).  hmm.

I will try out both your patch, which you think won't work, and various
combinations of ext3 (ordered and writeback) and ext2.  My target kernel
version is 2.4.19-pre7-aa2.  I'll try out vanilla pre7 if I have time too.

One interesting and unexpected result is that running inside a looped back
filesystem 1gb in size increases performance 4-fold from running on the
real filesystem!  That is, ext3-ordered looped on top of ext3-ordered is
much faster than ext3-ordered!  This is on kernel 2.4.17-rc2-aa2, which is
a bit old, so it could be meaningless....


P.S.  I created a benchmark of this phenomenon called blktest.c.  It's a 
bit rough (you need to recompile to change block size etc.).  It's 
attached.  It takes a single argument which is the number of concurrent 
writers.  Each writer writes an 8kb block to a random location in the 
file using pwrite.  The code is stupid in many places.  Excuse it.

| David Mansfield              |
| david cobite com             |
/* for pwrite */
#define _XOPEN_SOURCE 500

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
#include <sys/mman.h>

#define BLKSIZE 8192
#define FILESIZE (512*1024*1024) 

void die(const char * reason)
    fprintf(stderr, "dying: %s\n", reason);

void sig(int which)
    printf("received signal %d\n", which);

void do_child(int fd, int child, int * score)
    char buff[BLKSIZE];
    struct timeval tv;

    /* set random seed in each process */
    gettimeofday(&tv, NULL);

    memset(buff, child, BLKSIZE);

    while (1)
	int block = rand() % (FILESIZE/BLKSIZE);
	pwrite(fd, buff, BLKSIZE, block * BLKSIZE);

int main(int argc, char * argv[])
    int i, nr_procs, fd;
    pid_t * pid;
    struct sigaction sa;
    int score_fd;
    int * score;
    struct timeval start_tv, end_tv;
    int total_score = 0;
    double secs;

    if (argc < 2)

    if ((nr_procs = atoi(argv[1])) <= 0)

    /* the test file needs to be created beforehand */
    if ((fd = open("blktest.tmp", O_RDWR|O_SYNC)) < 0)
	die("please create a test file using:\n\ndd if=/dev/zero of=blktest.tmp bs=1k count=xxx");

    /* shared memory to keep the 'scoreboard' */
    if ((score_fd = open("/dev/zero", O_RDWR)) < 0)

    if ((score = (int*)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, score_fd, 0)) == MAP_FAILED)

    if (!(pid = (pid_t*)calloc(nr_procs, sizeof(pid_t))))

    printf("forking writers.\n");

    gettimeofday(&start_tv, NULL);

    for (i = 0; i < nr_procs; i++)
	if ((pid[i] = fork()) < 0)
	    int j;
	    for (j = 0; j < i; j++)
		kill(pid[j], SIGKILL);
	    goto cleanup;
	else if (pid[i] == 0)
	    do_child(fd, i, score + i);
	printf("forked process %d\n", pid[i]);

    memset(&sa, 0, sizeof(sa));
    sa.sa_handler = sig;
    sigaction(SIGINT, &sa, NULL);
    sigaction(SIGTERM, &sa, NULL);

    printf("children started, waiting for signal\n");

    while (i)
	pid_t dead = wait(NULL);
	printf("pid %d has exited\n", dead);
    gettimeofday(&end_tv, NULL);

    for (i = 0; i < nr_procs; i++)
	printf("score for %d: %d\n", i, score[i]);
	total_score += score[i];

    end_tv.tv_sec -= start_tv.tv_sec;
    end_tv.tv_usec -= end_tv.tv_usec;
    if (end_tv.tv_usec < 0)
	end_tv.tv_sec--, end_tv.tv_usec += 1000000;

    secs = (double)end_tv.tv_sec + (double)end_tv.tv_usec / 1000000.0;
    printf("total score: %d blocks in %.2f seconds %f blks/sec\n", total_score, secs, (double)total_score/secs);


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]