From leandro at dutra.fastmail.fm  Thu Apr  1 00:06:36 2004
From: leandro at dutra.fastmail.fm (=?iso-8859-1?q?Leandro_Guimar=E3es_Faria_Corsetti_Dutra?=)
Date: Wed, 31 Mar 2004 21:06:36 -0300
Subject: PROBLEM: log abort over RAID5
References: <pan.2004.03.05.13.45.37.186626@dutra.fastmail.fm>
	<1078873638.2460.81.camel@sisko.scot.redhat.com>
	<pan.2004.03.10.00.24.57.685979@dutra.fastmail.fm>
	<1080731519.1991.3.camel@sisko.scot.redhat.com>
Message-ID: <pan.2004.04.01.00.06.36.715070@dutra.fastmail.fm>

Em Wed, 31 Mar 2004 12:12:01 +0100, Stephen C. Tweedie escreveu:

> A _full_ checklist would include every piece of hardware in your machine,
> and every module you've got compiled or loaded into the kernel, plus a ton
> of privileged applications such as X.

	So you mean it is basically impossible, and I have to wait
until more people get ext3-on-RAID5 journal aborts, including a few
hackers?

	OK, so I've just gone back to 2.4 and will stay there for the
foreseeable future.


> I've been seeing some reports on raid5, yes.  Current kernels look OK in
> the main for most people, though there are still the occasional problems
> being discovered: such is 2.6.  Nothing springs to mind that particularly
> matches your own symptoms, though.

http://www.ussg.iu.edu/hypermail/linux/kernel/0402.3/1517.html
http://www.ussg.iu.edu/hypermail/linux/kernel/0401.2/0450.html
http://marc.theaimsgroup.com/?l=linux-raid&m=107351745306237&w=2

	This was just from the linux-kernel and linux-raid archives.
When I did Google around, I found some other similar ones...


-- 
Leandro Guimar?es Faria Corsetti Dutra <leandro at dutra.fastmail.fm>
Maring?, PR, BRASIL
http://br.geocities.com./lgcdutra/
Soli Deo Gloria!





From eagle112113 at yahoo.com  Thu Apr  1 09:40:49 2004
From: eagle112113 at yahoo.com (Scorpion Yang)
Date: Thu, 1 Apr 2004 01:40:49 -0800 (PST)
Subject: ext3_free_data()
Message-ID: <20040401094049.37281.qmail@web60907.mail.yahoo.com>

Hi,
why is there a window between  statements "err = ext3_journal_get_write_access(handle, this_bh);" and "ext3_journal_dirty_metadata(handle, this_bh);" ?
 
In my test, there is an error in journal_dirty_metadata: jh is null

thanks!
D.Yang
April 1, 2004
----------------------------------------------------------------------------------------------------------------------------------
/**
 * ext3_free_data - free a list of data blocks
 * @handle: handle for this transaction
 * @inode: inode we are dealing with
 * @this_bh: indirect buffer_head which contains *@first and *@last
 * @first: array of block numbers
 * @last: points immediately past the end of array
 *
 * We are freeing all blocks refered from that array (numbers are stored as
 * little-endian 32-bit) and updating @inode->i_blocks appropriately.
 *
 * We accumulate contiguous runs of blocks to free.  Conveniently, if these
 * blocks are contiguous then releasing them at one time will only affect one
 * or two bitmap blocks (+ group descriptor(s) and superblock) and we won't
 * actually use a lot of journal space.
 *
 * @this_bh will be %NULL if @first and @last point into the inode's direct
 * block pointers.
 */
static void ext3_free_data(handle_t *handle, struct inode *inode,
      struct buffer_head *this_bh, u32 *first, u32 *last)
{
 unsigned long block_to_free = 0;    /* Starting block # of a run */
 unsigned long count = 0;     /* Number of blocks in the run */ 
 u32 *block_to_free_p = NULL;     /* Pointer into inode/ind
            corresponding to
            block_to_free */
 unsigned long nr;      /* Current block # */
 u32 *p;        /* Pointer into inode/ind
            for current block */
 int err;
 if (this_bh) {    /* For indirect block */
  BUFFER_TRACE(this_bh, "get_write_access");
  err = ext3_journal_get_write_access(handle, this_bh);
  /* Important: if we can't update the indirect pointers
   * to the blocks, we can't free them. */
  if (err)
   return;
 }
 for (p = first; p < last; p++) {
  conditional_schedule();
  nr = le32_to_cpu(*p);
  if (nr) {
   /* accumulate blocks to free if they're contiguous */
   if (count == 0) {
    block_to_free = nr;
    block_to_free_p = p;
    count = 1;
   } else if (nr == block_to_free + count) {
    count++;
   } else {
    ext3_clear_blocks(handle, inode, this_bh, 
        block_to_free,
        count, block_to_free_p, p);
    block_to_free = nr;
    block_to_free_p = p;
    count = 1;
   }
  }
 }
 if (count > 0)
  ext3_clear_blocks(handle, inode, this_bh, block_to_free,
      count, block_to_free_p, p);
 if (this_bh) {
  BUFFER_TRACE(this_bh, "call ext3_journal_dirty_metadata");
  ext3_journal_dirty_metadata(handle, this_bh);
 }
}


---------------------------------
Do you Yahoo!?
Yahoo! Small Business $15K Web Design Giveaway - Enter today
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20040401/3bff0c05/attachment.htm>

From guolin at alexa.com  Fri Apr  2 06:36:41 2004
From: guolin at alexa.com (Guolin Cheng)
Date: Thu, 1 Apr 2004 22:36:41 -0800
Subject: Strange Fedora Booting problem: can not mount "LABEL=*" partitions
Message-ID: <41089CB27BD8D24E8385C8003EDAF7AB084841@karl.alexa.com>

Hi, 

   Just got Fedora FC1 vanilla 2.4.25kernel+libata8patch booting problems, FC1 complains that it can not automatically find&found partitions specified with "LABEL=" in /etc/fstab, and then falls me into repair mode. In the repair mode I can mount it manually without any problems. More interesting are: 1) I have several partitions specified with "LABEL=*" in /etc/fstab, but FC1 always can not identify same partition even on different machines; 2) the default&upgraded ntpl kernel boots up without problems.  My fstab is attached below:

LABEL=/                 /                       ext3    defaults        1 1
LABEL=/0                /0                      ext3    defaults        1 2
/dev/hdc1               /1                      ext3    defaults        1 2
LABEL=/alexa            /alexa                  ext3    defaults        1 2
none                    /dev/pts                devpts  gid=5,mode=620  0 0
none                    /proc                   proc    defaults        0 0
none                    /dev/shm                tmpfs   defaults        0 0
LABEL=/usr              /usr                    ext3    defaults        1 2
LABEL=/var              /var                    ext3    defaults        1 2
/dev/hda7               swap                    swap    defaults        0 0
/dev/hda6               swap                    swap    defaults        0 0
/dev/hda8               swap                    swap    defaults        0 0
/dev/fd0                /mnt/floppy             auto    noauto,owner,kudzu 0 0
ops-test1.alexa.com guolin 134%

 FC1 stops on partitions "LABEL=/var" on two machines, stops on partition "LABEL=/" on the 3rd machine. While the default|upgraded NTPL kernel (with SMP problem) boots without a glitch, my vanilla 2.4.25 kernel plus libata patch 2.4.25-libata8 fails with the above symptoms described.

 The solution to fix it is:  manually run "e2fsck -y -f  /dev/hd?, tune2fs -j /dev/hd?; e2label /dev/hd? <LABEL>" again even there is no problem with file system, journal node and ext2 label, then reboot. 
  
  SInce we have several hundreds of RH8 machines to upgrade to Fedora, we can not endure to fix booting problem one by one, So where is the problem? File system utilites? 2.4.25 kernel? or the libata patch? 

  The machines has Fedora Core 1 with all packages upgraded: util-linux-2.11y-29, e2fsprogs-1.34-1, 2.4.25+2.4.25-libata8. 

  The system disk's partitions were originally created under Redhat 8.0. This upgrade to FC1 is as simple as: booting the machines into a FC1 diskless mode, then create file system on existing /, /usr, /var partitions resides on system disk, label 3 partitions and and dump system tarballs onto them,  install lilo bootload onto system disk  and reboot. The simple&efficient way works great for years for us except this time. :(

  Any suggestions? and what's the difference between 2.4.25-libata8 patch and 2.4.25-libata16 (bleeding-edge) patches?


  Thanks a lot.

  --Guolin Cheng
  





From guolin at alexa.com  Fri Apr  2 06:53:07 2004
From: guolin at alexa.com (Guolin Cheng)
Date: Thu, 1 Apr 2004 22:53:07 -0800
Subject: Strange Fedora Booting problem: can not mount "LABEL=*"
	partitions
Message-ID: <41089CB27BD8D24E8385C8003EDAF7ABBA487B@karl.alexa.com>

Hi,

 Sorry, NPTL instead of NTPL, typo. too embarrassed. :(

 --Guolin 

-----Original Message-----
From: Guolin Cheng 
Sent: Thursday, April 01, 2004 10:37 PM
To: Fedora (E-mail); Redhat Ext3 (E-mail); jgarzik at redhat.com
Subject: Strange Fedora Booting problem: can not mount "LABEL=*"
partitions


Hi, 

   Just got Fedora FC1 vanilla 2.4.25kernel+libata8patch booting problems, FC1 complains that it can not automatically find&found partitions specified with "LABEL=" in /etc/fstab, and then falls me into repair mode. In the repair mode I can mount it manually without any problems. More interesting are: 1) I have several partitions specified with "LABEL=*" in /etc/fstab, but FC1 always can not identify same partition even on different machines; 2) the default&upgraded ntpl kernel boots up without problems.  My fstab is attached below:

LABEL=/                 /                       ext3    defaults        1 1
LABEL=/0                /0                      ext3    defaults        1 2
/dev/hdc1               /1                      ext3    defaults        1 2
LABEL=/alexa            /alexa                  ext3    defaults        1 2
none                    /dev/pts                devpts  gid=5,mode=620  0 0
none                    /proc                   proc    defaults        0 0
none                    /dev/shm                tmpfs   defaults        0 0
LABEL=/usr              /usr                    ext3    defaults        1 2
LABEL=/var              /var                    ext3    defaults        1 2
/dev/hda7               swap                    swap    defaults        0 0
/dev/hda6               swap                    swap    defaults        0 0
/dev/hda8               swap                    swap    defaults        0 0
/dev/fd0                /mnt/floppy             auto    noauto,owner,kudzu 0 0
ops-test1.alexa.com guolin 134%

 FC1 stops on partitions "LABEL=/var" on two machines, stops on partition "LABEL=/" on the 3rd machine. While the default|upgraded NTPL kernel (with SMP problem) boots without a glitch, my vanilla 2.4.25 kernel plus libata patch 2.4.25-libata8 fails with the above symptoms described.

 The solution to fix it is:  manually run "e2fsck -y -f  /dev/hd?, tune2fs -j /dev/hd?; e2label /dev/hd? <LABEL>" again even there is no problem with file system, journal node and ext2 label, then reboot. 
  
  SInce we have several hundreds of RH8 machines to upgrade to Fedora, we can not endure to fix booting problem one by one, So where is the problem? File system utilites? 2.4.25 kernel? or the libata patch? 

  The machines has Fedora Core 1 with all packages upgraded: util-linux-2.11y-29, e2fsprogs-1.34-1, 2.4.25+2.4.25-libata8. 

  The system disk's partitions were originally created under Redhat 8.0. This upgrade to FC1 is as simple as: booting the machines into a FC1 diskless mode, then create file system on existing /, /usr, /var partitions resides on system disk, label 3 partitions and and dump system tarballs onto them,  install lilo bootload onto system disk  and reboot. The simple&efficient way works great for years for us except this time. :(

  Any suggestions? and what's the difference between 2.4.25-libata8 patch and 2.4.25-libata16 (bleeding-edge) patches?


  Thanks a lot.

  --Guolin Cheng
  



_______________________________________________
Ext3-users mailing list
Ext3-users at redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users




From crosser at rol.ru  Mon Apr  5 15:06:27 2004
From: crosser at rol.ru (Eugene Crosser)
Date: Mon, 05 Apr 2004 19:06:27 +0400
Subject: stalled 'sync' on ext3+quota over drbd
In-Reply-To: <1080740974.1991.28.camel@sisko.scot.redhat.com>
References: <1080125239.4717.33.camel@ariel.sovam.com>
	<1080737188.1991.9.camel@sisko.scot.redhat.com>
	<1080738345.22942.53.camel@ariel.sovam.com>
	<1080740974.1991.28.camel@sisko.scot.redhat.com>
Message-ID: <1081177587.7677.110.camel@ariel.sovam.com>

On Wed, 2004-03-31 at 17:49, Stephen C. Tweedie wrote:

> > I'd be happy to provide more information but so far I cannot decide
> > where to look...  Should I learn to use "kernel profiling"?
> 
> Sound like it.  You've got two choices --- the simple "readprofile"
> (boot with profile=2), or set up an oprofile kernel.  For complex
> user/kernel interactions oprofile can be really helpful, but for
> something that's simply stuck in the kernel, readprofile is fine.

OK, this is readprofile output of sync(1).  To reproduce the situation,
I did a lot of copying of data and in parallel, setqouta for a few
thousands of group ids.  After this kind of activity, sync becomes slow
(in my case it took a couple of minutes.  It will take much more after
more activity).

$ readprofile -m /isolinux/System.map |sort -n|tail -20
    11 zap_pte_range                              0.0220
    13 system_call                                0.2321
    14 do_wp_page                                 0.0182
    19 __find_get_page                            0.2375
    35 __constant_memcpy                          0.1287
    35 ext3_group_sparse                          0.1683
    47 .text.lock.tty_io                          0.1196
    70 dqget                                      0.1326
    76 .text.lock.inode                           0.3028
    89 do_page_fault                              0.0668
    97 .text.lock.namei                           0.0820
   105 .text.lock.read_write                      0.9052
   138 .text.lock.attr                            2.1562
   202 .text.lock.inode                           0.3033
   388 .text.lock.ioctl                          10.7778
   445 .text.lock.exit                            1.5188
  1283 default_idle                              16.0375
  2942 .text.lock.sched                           7.8245
  4414 vfs_quota_sync                            11.0350
 10775 total                                      0.0060

Does it help?  Tell me what to do next.

Eugene
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20040405/5ef263ee/attachment.sig>

From crosser at rol.ru  Tue Apr  6 12:50:26 2004
From: crosser at rol.ru (Eugene Crosser)
Date: Tue, 06 Apr 2004 16:50:26 +0400
Subject: stalled 'sync' on ext3+quota over drbd
In-Reply-To: <1081177587.7677.110.camel@ariel.sovam.com>
References: <1080125239.4717.33.camel@ariel.sovam.com>
	<1080737188.1991.9.camel@sisko.scot.redhat.com>
	<1080738345.22942.53.camel@ariel.sovam.com>
	<1080740974.1991.28.camel@sisko.scot.redhat.com>
	<1081177587.7677.110.camel@ariel.sovam.com>
Message-ID: <1081255826.22308.57.camel@ariel.sovam.com>

More representative statistics for my "quota on ext3" trouble:

after moving about 10,000 files and setting quota for a million
groupids, and then several hours of inactivity(!) I zeroed profile
counters (readprofile -r), ran `time sync' and then `readprofile'.  Here
are the results.  Yes, that's true, it took 3 (three) hours for `sync'
to complete!

root at nfsa2:/root# time sync
real    179m43.273s
user    0m0.000s
sys     177m51.640s
 
root at nfsa2:~# readprofile -m /isolinux/System.map |sort -n|tail -20
    84 system_call                                1.5000
    96 .text.lock.ioctl                           2.6667
   100 .text.lock.namei                           0.0845
   112 tg3_poll                                   0.3684
   116 csum_partial_copy_generic                  0.4603
   123 serial_out                                 1.9219
   136 scsi_dispatch_cmd                          0.2024
   155 megaraid_isr_memmapped                     1.6146
   377 serial_in                                  7.8542
   436 .text.lock.module                          1.6769
  1209 do_softirq                                 5.3973
  1705 .text.lock.inode                           2.5601
  2199 .text.lock.exit                            7.5051
 18622 .text.lock.tty_io                         47.3842
 70802 .text.lock.buffer                        111.6751
139032 .text.lock.read_write                    1198.5517
501133 .text.lock.sched                         1332.8005
513897 default_idle                             6423.7125
1065574 vfs_quota_sync                           2663.9350
2318782 total                                      1.2959

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20040406/3aa92402/attachment.sig>

From mvolaski at aecom.yu.edu  Fri Apr  9 07:58:03 2004
From: mvolaski at aecom.yu.edu (Maurice Volaski)
Date: Fri, 9 Apr 2004 03:58:03 -0400
Subject: [Q] Where and what is t-cs.gmo?
In-Reply-To: <20040406160004.24079742BA@hormel.redhat.com>
References: <20040406160004.24079742BA@hormel.redhat.com>
Message-ID: <a06100501bc9bf9ba1522@[129.98.90.227]>

Compiling e2fsprogs-1.35 under Linux complains....

make[2]: Entering directory `/src/kernel/e2fsprogs-1.35/po'
: --update cs.po e2fsprogs.pot
rm -f cs.gmo && : -c --statistics -o cs.gmo cs.po
mv: cannot stat `t-cs.gmo': No such file or directory
make[2]: *** [cs.gmo] Error 1
make[2]: Leaving directory `/src/kernel/e2fsprogs-1.35/po'
make[1]: *** [all-progs-recursive] Error 1
make[1]: Leaving directory `/src/kernel/e2fsprogs-1.35'
make: *** [all] Error 2
-- 

Maurice Volaski, mvolaski at aecom.yu.edu
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University




From tytso at mit.edu  Sun Apr 11 19:47:52 2004
From: tytso at mit.edu (Theodore Ts'o)
Date: Sun, 11 Apr 2004 15:47:52 -0400
Subject: [Q] Where and what is t-cs.gmo?
In-Reply-To: <a06100501bc9bf9ba1522@[129.98.90.227]>
References: <20040406160004.24079742BA@hormel.redhat.com>
	<a06100501bc9bf9ba1522@[129.98.90.227]>
Message-ID: <20040411194752.GA8479@thunk.org>

On Fri, Apr 09, 2004 at 03:58:03AM -0400, Maurice Volaski wrote:
> Compiling e2fsprogs-1.35 under Linux complains....
> 
> make[2]: Entering directory `/src/kernel/e2fsprogs-1.35/po'
> : --update cs.po e2fsprogs.pot
> rm -f cs.gmo && : -c --statistics -o cs.gmo cs.po
> mv: cannot stat `t-cs.gmo': No such file or directory
> make[2]: *** [cs.gmo] Error 1
> make[2]: Leaving directory `/src/kernel/e2fsprogs-1.35/po'
> make[1]: *** [all-progs-recursive] Error 1
> make[1]: Leaving directory `/src/kernel/e2fsprogs-1.35'
> make: *** [all] Error 2

Try using "configure --with-include-gettext".  That should solve the
problem; it's caused by your system having too old a copy of the
gettext library, and automatic autoconf macros supplied by gettext
aren't particularly good at handling backwards compatibility with
older gettext implementations.

Or you could just ignore the errors; the errors occured trying to
build the translation data files, which happens at the end of the
build process, and everything else important has been built.  So if
you don't care about seeing e2fsck messages being output in Polish or
Turkish, you can just ignore the errors.  :-)

					- Ted




From mvolaski at aecom.yu.edu  Sun Apr 11 20:20:39 2004
From: mvolaski at aecom.yu.edu (Maurice Volaski)
Date: Sun, 11 Apr 2004 16:20:39 -0400
Subject: [Q] Where and what is t-cs.gmo?
In-Reply-To: <20040411194752.GA8479@thunk.org>
References: <20040406160004.24079742BA@hormel.redhat.com>
	<a06100501bc9bf9ba1522@[129.98.90.227]>
	<20040411194752.GA8479@thunk.org>
Message-ID: <a06100502bc9f51a3609e@[129.98.90.227]>

>On Fri, Apr 09, 2004 at 03:58:03AM -0400, Maurice Volaski wrote:
>>  Compiling e2fsprogs-1.35 under Linux complains....
>>
>>  make[2]: Entering directory `/src/kernel/e2fsprogs-1.35/po'
>>  : --update cs.po e2fsprogs.pot
>>  rm -f cs.gmo && : -c --statistics -o cs.gmo cs.po
>>  mv: cannot stat `t-cs.gmo': No such file or directory
>>  make[2]: *** [cs.gmo] Error 1
>>  make[2]: Leaving directory `/src/kernel/e2fsprogs-1.35/po'
>>  make[1]: *** [all-progs-recursive] Error 1
>>  make[1]: Leaving directory `/src/kernel/e2fsprogs-1.35'
>>  make: *** [all] Error 2
>
>Try using "configure --with-include-gettext".  That should solve the
>problem; it's caused by your system having too old a copy of the
>gettext library, and automatic autoconf macros supplied by gettext
>aren't particularly good at handling backwards compatibility with
>older gettext implementations.

I didn't seem to have gettext at all, so I installed version 0.14.1. 
And ./configure said:

checking for xgettext... (cached) no

despite
whereis xgettext
xgettext: /usr/local/bin/xgettext

Then I found an option to configure using the included gettext as you 
mention above. That initially gave

/bin/chmod +x mk_cmds
../et/compile_et --build-tree ./ss_err.et
../et/compile_et: /usr/bin/awk: No such file or directory
../et/compile_et: /usr/bin/awk: No such file or directory
make[2]: *** [ss_err.c] Error 127

which I now realize occurs because it is hard-coded to look for awk 
in /usr/bin.

I just tried again after symlinking awk in /usr/bin and got:

: multiple definition of `_nl_find_msg'
../intl/libintl.a(dcigettext.o)(.text+0x6d8):/src/kernel/e2fsprogs-1.35/intl/dcigettext.c:698: 
first defined here
/usr/bin/ld: Warning: size of symbol `_nl_find_msg' changed from 1325 
in ../intl/libintl.a(dcigettext.o) to 1309 in 
../intl/libintl.a(dcigettext.o)


>build process, and everything else important has been built.  So if
>you don't care about seeing e2fsck messages being output in Polish or
>Turkish, you can just ignore the errors.  :-)

I just ended up compiling with  ./configure --disable-nls
-- 

Maurice Volaski, mvolaski at aecom.yu.edu
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University




From tytso at mit.edu  Mon Apr 12 02:49:10 2004
From: tytso at mit.edu (Theodore Ts'o)
Date: Sun, 11 Apr 2004 22:49:10 -0400
Subject: [Q] Where and what is t-cs.gmo?
In-Reply-To: <a06100502bc9f51a3609e@[129.98.90.227]>
References: <20040406160004.24079742BA@hormel.redhat.com>
	<a06100501bc9bf9ba1522@[129.98.90.227]>
	<20040411194752.GA8479@thunk.org>
	<a06100502bc9f51a3609e@[129.98.90.227]>
Message-ID: <20040412024910.GA15438@thunk.org>

On Sun, Apr 11, 2004 at 04:20:39PM -0400, Maurice Volaski wrote:
> I didn't seem to have gettext at all, so I installed version 0.14.1. 
> And ./configure said:
> 
> checking for xgettext... (cached) no
> 
> despite
> whereis xgettext
> xgettext: /usr/local/bin/xgettext

configure checks for xgettext in each of the directories in your PATH.
Apparently you don't have /usr/local/bin in your PATH.

> Then I found an option to configure using the included gettext as you 
> mention above. That initially gave
> 
> /bin/chmod +x mk_cmds
> ../et/compile_et --build-tree ./ss_err.et
> ../et/compile_et: /usr/bin/awk: No such file or directory
> ../et/compile_et: /usr/bin/awk: No such file or directory
> make[2]: *** [ss_err.c] Error 127
> 
> which I now realize occurs because it is hard-coded to look for awk 
> in /usr/bin.

No, it's not hard-coded.  compile_et is created from compile_et_sh.in,
which uses awk from the location determined by configure:

#!/bin/sh
#
#
AWK=@AWK@

So if it was /usr/bin/awk, it was because the configure script thought
it was there.  I'm not sure why that would have been the case, unless
you had a config.cache file generated from another system.

> I just tried again after symlinking awk in /usr/bin and got:
> 
> : multiple definition of `_nl_find_msg'
> ../intl/libintl.a(dcigettext.o)(.text+0x6d8):/src/kernel/e2fsprogs-1.35/intl/dcigettext.c:698: 
> first defined here
> /usr/bin/ld: Warning: size of symbol `_nl_find_msg' changed from 1325 
> in ../intl/libintl.a(dcigettext.o) to 1309 in 
> ../intl/libintl.a(dcigettext.o)

Yeah, welcome to more gettext fragileness.  You probably have some
older version of gettext in /usr/include that is still being used
during the compile, and the header files are conflicting.  

> >build process, and everything else important has been built.  So if
> >you don't care about seeing e2fsck messages being output in Polish or
> >Turkish, you can just ignore the errors.  :-)
> 
> I just ended up compiling with  ./configure --disable-nls

That's probably the best course.

Which distribution were you using?  E2fsprogs compiles just fine on
Red Hat and Debian....

					- Ted




From mvolaski at aecom.yu.edu  Mon Apr 12 05:33:48 2004
From: mvolaski at aecom.yu.edu (Maurice Volaski)
Date: Mon, 12 Apr 2004 01:33:48 -0400
Subject: [Q] Where and what is t-cs.gmo?
In-Reply-To: <20040412024910.GA15438@thunk.org>
References: <20040406160004.24079742BA@hormel.redhat.com>
	<a06100501bc9bf9ba1522@[129.98.90.227]>
	<20040411194752.GA8479@thunk.org>
	<a06100502bc9f51a3609e@[129.98.90.227]>
	<20040412024910.GA15438@thunk.org>
Message-ID: <a06100504bc9fd33ac006@[129.98.90.227]>

>configure checks for xgettext in each of the directories in your PATH.
>Apparently you don't have /usr/local/bin in your PATH.

Of course I do, so something is else is going on.

>No, it's not hard-coded.  compile_et is created from compile_et_sh.in,
>which uses awk from the location determined by configure:
>
>#!/bin/sh
>#
>#
>AWK=@AWK@
>
>So if it was /usr/bin/awk, it was because the configure script thought
>it was there.  I'm not sure why that would have been the case, unless
>you had a config.cache file generated from another system.

OK, make clean apparently leaves config.cache alone, so it had to 
have been inserted there when I initially ran it before I had 
upgraded awk, which got placed in /usr/local/bin/awk.

>Yeah, welcome to more gettext fragileness.  You probably have some
>older version of gettext in /usr/include that is still being used
>during the compile, and the header files are conflicting.

I can find it only in /usr/local/include.

>  > >build process, and everything else important has been built.  So if
>>  >you don't care about seeing e2fsck messages being output in Polish or
>>  >Turkish, you can just ignore the errors.  :-)
>>
>>  I just ended up compiling with  ./configure --disable-nls
>
>That's probably the best course.

Perhaps the INSTALL doc itself should mention about potential 
compiling problems with NLS and even consider having it disabled by 
default.

>Which distribution were you using?  E2fsprogs compiles just fine on
>Red Hat and Debian....
>

It is RedHat 7.1, but modified much over the years.
-- 

Maurice Volaski, mvolaski at aecom.yu.edu
Computing Support, Rose F. Kennedy Center
Albert Einstein College of Medicine of Yeshiva University




From crosser at rol.ru  Mon Apr 12 07:20:47 2004
From: crosser at rol.ru (Eugene Crosser)
Date: Mon, 12 Apr 2004 11:20:47 +0400
Subject: [Q] Where and what is t-cs.gmo?
In-Reply-To: <20040412024910.GA15438@thunk.org>
References: <20040406160004.24079742BA@hormel.redhat.com>
	<a06100501bc9bf9ba1522@[129.98.90.227]>
	<20040411194752.GA8479@thunk.org>
	<a06100502bc9f51a3609e@[129.98.90.227]>
	<20040412024910.GA15438@thunk.org>
Message-ID: <1081754447.17109.6.camel@ariel.sovam.com>

On Mon, 2004-04-12 at 06:49, Theodore Ts'o wrote:
> On Sun, Apr 11, 2004 at 04:20:39PM -0400, Maurice Volaski wrote:
> > I didn't seem to have gettext at all, so I installed version 0.14.1. 
> > And ./configure said:
> > 
> > checking for xgettext... (cached) no
> > 
> > despite
> > whereis xgettext
> > xgettext: /usr/local/bin/xgettext
> 
> configure checks for xgettext in each of the directories in your PATH.
> Apparently you don't have /usr/local/bin in your PATH.

No, I think Maurice should just "rm config.cache" and "./configure"
again.

Eugene
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20040412/eed01b55/attachment.sig>

From kfitzner at excelcia.org  Tue Apr 13 11:05:08 2004
From: kfitzner at excelcia.org (Kurt Fitzner)
Date: Tue, 13 Apr 2004 05:05:08 -0600
Subject: EXT3 on raid with external journal...
Message-ID: <407BC964.4030206@excelcia.org>

I have a raid5 array on my web server for which I am currently 
considering a move to ext3.  I want to use an external journal to 
improve performance.

Since the external journal would reside on a drive  that is not 
participating in the raid array, I'm wondering what the behavior of an 
ext3 filesystem is should the device an external journal is on should 
fail.  If it reverts to ext2 behavior upon failure, then I can justify 
using a non-raid device for an external journal.




From rkalaskar at aethon.com  Tue Apr 13 18:55:46 2004
From: rkalaskar at aethon.com (Rahul Kalaskar)
Date: Tue, 13 Apr 2004 14:55:46 -0400
Subject: logging disk activity
Message-ID: <1081882546.17960.1.camel@rkalaskar>

Hi all,

I would like to know how often a writes happen on ext3 fs. Is there any
way to find this out?

Thanks
Rahul




From kfitzner at excelcia.org  Tue Apr 13 20:28:46 2004
From: kfitzner at excelcia.org (Kurt Fitzner)
Date: Tue, 13 Apr 2004 14:28:46 -0600
Subject: logging disk activity
In-Reply-To: <1081882546.17960.1.camel@rkalaskar>
References: <1081882546.17960.1.camel@rkalaskar>
Message-ID: <407C4D7E.8000604@excelcia.org>

Rahul Kalaskar wrote:
> I would like to know how often a writes happen on ext3 fs. Is there any
> way to find this out?

Data is flushed to the journal every 5 seconds, as opposed to ext2 where 
it is flushed every 30.




From mb/ext3 at dcs.qmul.ac.uk  Wed Apr 14 09:06:08 2004
From: mb/ext3 at dcs.qmul.ac.uk (Matt Bernstein)
Date: Wed, 14 Apr 2004 10:06:08 +0100 (BST)
Subject: EXT3 on raid with external journal...
In-Reply-To: <407BC964.4030206@excelcia.org>
References: <407BC964.4030206@excelcia.org>
Message-ID: <Pine.LNX.4.58.0404141002560.3742@lucy.dcs.qmul.ac.uk>

On Apr 13 Kurt Fitzner wrote:

>I have a raid5 array on my web server for which I am currently 
>considering a move to ext3.  I want to use an external journal to 
>improve performance.
>
>Since the external journal would reside on a drive  that is not 
>participating in the raid array, I'm wondering what the behavior of an 
>ext3 filesystem is should the device an external journal is on should 
>fail.  If it reverts to ext2 behavior upon failure, then I can justify 
>using a non-raid device for an external journal.

There could be metadata which is only in the journal, so failure probably 
means reboot + full fsck, so you may as well use ext2 if your machine 
doesn't otherwise crash.

Far preferable, I think, would be to put your journal on a RAID 1 pair.




From mcuss at cdlsystems.com  Wed Apr 14 18:43:05 2004
From: mcuss at cdlsystems.com (Mark Cuss)
Date: Wed, 14 Apr 2004 12:43:05 -0600
Subject: Question about EXT3 error messages in /var/log/messages
Message-ID: <209f01c42250$531e76d0$ab0e10ac@pinchy>

Hello list

I've been having the following error messages pop up in my kernel log:

Apr 12 04:08:09 hal kernel: EXT3-fs error (device md(9,2)): ext3_readdir:
bad entry in directory #2670595: rec_len %% 4 != 0 - offset=0,
inode=827218527, rec_len=20275, name_len=73
Apr 12 04:08:14 hal kernel: EXT3-fs error (device md(9,2)): ext3_readdir:
bad entry in directory #2670596: rec_len %% 4 != 0 - offset=0,
inode=861103477, rec_len=95, name_len=95
Apr 12 04:08:17 hal kernel: EXT3-fs error (device md(9,2)): ext3_readdir:
bad entry in directory #2670597: rec_len %% 4 != 0 - offset=0,
inode=1601531495, rec_len=30819, name_len=120
Apr 12 04:08:20 hal kernel: EXT3-fs error (device md(9,2)): ext3_readdir:
bad entry in directory #2670598: rec_len %% 4 != 0 - offset=0,
inode=1634890872, rec_len=29795, name_len=111
Apr 12 04:08:32 hal kernel: EXT3-fs error (device md(9,2)): ext3_readdir:
bad entry in directory #2670599: rec_len %% 4 != 0 - offset=0,
inode=1951614277, rec_len=12337, name_len=95

I've done some searching and talked to some people on another mailing
list...  I can't seem to figure out which device these errors occur on...
I'd been told that the "device md(9,2)" ID indicates major 9 and minor 2,
and I've also been told that the 9 is the SCSI channel once 8 is subtracted
from is (so, 1 in this case), and that 2 is the SCSI Id of the offending
device in that channel.

So, I'm a little lost here.  I figured the experts here could let me know
how to map the reported ID numbers to a physical disk or RAID device.

Thanks in advance!

Mark

Mark Cuss, B. Sc.
Real Time Systems Analyst
System Administrator
CDL Systems Ltd
Suite 230
3553 - 31 Street NW
Calgary, AB, Canada

Phone: 403 289 1733 ext 226
Fax: 403 282 1238
www.cdlsystems.com





From mbasil at alabanza.com  Wed Apr 14 19:08:37 2004
From: mbasil at alabanza.com (Mark Basil)
Date: Wed, 14 Apr 2004 15:08:37 -0400
Subject: Question about EXT3 error messages in /var/log/messages
In-Reply-To: <209f01c42250$531e76d0$ab0e10ac@pinchy>
References: <209f01c42250$531e76d0$ab0e10ac@pinchy>
Message-ID: <1081969716.29735.105.camel@mbasil.alabanza.com>

Mark C.,

Try

ls -l /dev/ | grep "9,   2"

-Mark B.

On Wed, 2004-04-14 at 14:43, Mark Cuss wrote:
> Hello list
> 
> I've been having the following error messages pop up in my kernel log:
> 
> Apr 12 04:08:09 hal kernel: EXT3-fs error (device md(9,2)): ext3_readdir:
> bad entry in directory #2670595: rec_len %% 4 != 0 - offset=0,
> inode=827218527, rec_len=20275, name_len=73
> Apr 12 04:08:14 hal kernel: EXT3-fs error (device md(9,2)): ext3_readdir:
> bad entry in directory #2670596: rec_len %% 4 != 0 - offset=0,
> inode=861103477, rec_len=95, name_len=95
> Apr 12 04:08:17 hal kernel: EXT3-fs error (device md(9,2)): ext3_readdir:
> bad entry in directory #2670597: rec_len %% 4 != 0 - offset=0,
> inode=1601531495, rec_len=30819, name_len=120
> Apr 12 04:08:20 hal kernel: EXT3-fs error (device md(9,2)): ext3_readdir:
> bad entry in directory #2670598: rec_len %% 4 != 0 - offset=0,
> inode=1634890872, rec_len=29795, name_len=111
> Apr 12 04:08:32 hal kernel: EXT3-fs error (device md(9,2)): ext3_readdir:
> bad entry in directory #2670599: rec_len %% 4 != 0 - offset=0,
> inode=1951614277, rec_len=12337, name_len=95
> 
> I've done some searching and talked to some people on another mailing
> list...  I can't seem to figure out which device these errors occur on...
> I'd been told that the "device md(9,2)" ID indicates major 9 and minor 2,
> and I've also been told that the 9 is the SCSI channel once 8 is subtracted
> from is (so, 1 in this case), and that 2 is the SCSI Id of the offending
> device in that channel.
> 
> So, I'm a little lost here.  I figured the experts here could let me know
> how to map the reported ID numbers to a physical disk or RAID device.
> 
> Thanks in advance!
> 
> Mark
> 
> Mark Cuss, B. Sc.
> Real Time Systems Analyst
> System Administrator
> CDL Systems Ltd
> Suite 230
> 3553 - 31 Street NW
> Calgary, AB, Canada
> 
> Phone: 403 289 1733 ext 226
> Fax: 403 282 1238
> www.cdlsystems.com
> 
> 
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
> 




From mcuss at cdlsystems.com  Wed Apr 14 19:10:03 2004
From: mcuss at cdlsystems.com (Mark Cuss)
Date: Wed, 14 Apr 2004 13:10:03 -0600
Subject: Question about EXT3 error messages in /var/log/messages
References: <209f01c42250$531e76d0$ab0e10ac@pinchy>
	<1081969716.29735.105.camel@mbasil.alabanza.com>
Message-ID: <20ac01c42254$17343070$ab0e10ac@pinchy>

Okay - so it is the major and minor numbers - thanks!  That means that md2
is the culprit...

Does this mean that I have a drive failing in this raid or could the
filesystem just need an fsck?

Thanks
Mark

----- Original Message ----- 
From: "Mark Basil" <mbasil at alabanza.com>
To: <mcuss at cdlsystems.com>
Cc: <ext3-users at redhat.com>
Sent: Wednesday, April 14, 2004 1:08 PM
Subject: Re: Question about EXT3 error messages in /var/log/messages


> Mark C.,
>
> Try
>
> ls -l /dev/ | grep "9,   2"
>
> -Mark B.
>
> On Wed, 2004-04-14 at 14:43, Mark Cuss wrote:
> > Hello list
> >
> > I've been having the following error messages pop up in my kernel log:
> >
> > Apr 12 04:08:09 hal kernel: EXT3-fs error (device md(9,2)):
ext3_readdir:
> > bad entry in directory #2670595: rec_len %% 4 != 0 - offset=0,
> > inode=827218527, rec_len=20275, name_len=73
> > Apr 12 04:08:14 hal kernel: EXT3-fs error (device md(9,2)):
ext3_readdir:
> > bad entry in directory #2670596: rec_len %% 4 != 0 - offset=0,
> > inode=861103477, rec_len=95, name_len=95
> > Apr 12 04:08:17 hal kernel: EXT3-fs error (device md(9,2)):
ext3_readdir:
> > bad entry in directory #2670597: rec_len %% 4 != 0 - offset=0,
> > inode=1601531495, rec_len=30819, name_len=120
> > Apr 12 04:08:20 hal kernel: EXT3-fs error (device md(9,2)):
ext3_readdir:
> > bad entry in directory #2670598: rec_len %% 4 != 0 - offset=0,
> > inode=1634890872, rec_len=29795, name_len=111
> > Apr 12 04:08:32 hal kernel: EXT3-fs error (device md(9,2)):
ext3_readdir:
> > bad entry in directory #2670599: rec_len %% 4 != 0 - offset=0,
> > inode=1951614277, rec_len=12337, name_len=95
> >
> > I've done some searching and talked to some people on another mailing
> > list...  I can't seem to figure out which device these errors occur
on...
> > I'd been told that the "device md(9,2)" ID indicates major 9 and minor
2,
> > and I've also been told that the 9 is the SCSI channel once 8 is
subtracted
> > from is (so, 1 in this case), and that 2 is the SCSI Id of the offending
> > device in that channel.
> >
> > So, I'm a little lost here.  I figured the experts here could let me
know
> > how to map the reported ID numbers to a physical disk or RAID device.
> >
> > Thanks in advance!
> >
> > Mark
> >
> > Mark Cuss, B. Sc.
> > Real Time Systems Analyst
> > System Administrator
> > CDL Systems Ltd
> > Suite 230
> > 3553 - 31 Street NW
> > Calgary, AB, Canada
> >
> > Phone: 403 289 1733 ext 226
> > Fax: 403 282 1238
> > www.cdlsystems.com
> >
> >
> >
> > _______________________________________________
> > Ext3-users mailing list
> > Ext3-users at redhat.com
> > https://www.redhat.com/mailman/listinfo/ext3-users
> >
>
>





From mbasil at alabanza.com  Wed Apr 14 19:54:32 2004
From: mbasil at alabanza.com (Mark Basil)
Date: Wed, 14 Apr 2004 15:54:32 -0400
Subject: Question about EXT3 error messages in /var/log/messages
In-Reply-To: <20ac01c42254$17343070$ab0e10ac@pinchy>
References: <209f01c42250$531e76d0$ab0e10ac@pinchy>
	<1081969716.29735.105.camel@mbasil.alabanza.com>
	<20ac01c42254$17343070$ab0e10ac@pinchy>
Message-ID: <1081972472.29737.121.camel@mbasil.alabanza.com>

It very well could mean either.  Something caused the corruption, which
could hint to the drive going bad.  I'd say 90% of the time, an fsck
will correct an error like this.

Before the fsck, you might want to do some digging as to which
directories contain those corrupt files if you care to know where the
corruption occurred.

Search your other logs around that timeframe, Apr 12 04:08:09, and see
what processes got kicked off and were reading/writing the filesytem.

Also, I'm not sure exactly how to directly get the directory name for an
inode, but you can list the contents of that directory, and go from
there.

Also, I don't know what happens when you try and load up RAID drives
into debugfs as I've never done it, but if you CAN do it, here how it
would be done:

$ debugfs

debugfs: open /dev/md2
debugfs: ls <2670595>

That should give you the contents of that directory.  Take a filename
from there, and do a find or locate on it if it's not obvious at the
time.

Good luck.

-Mark B.

On Wed, 2004-04-14 at 15:10, Mark Cuss wrote:
> Okay - so it is the major and minor numbers - thanks!  That means that md2
> is the culprit...
> 
> Does this mean that I have a drive failing in this raid or could the
> filesystem just need an fsck?
> 
> Thanks
> Mark
> 
> ----- Original Message ----- 
> From: "Mark Basil" <mbasil at alabanza.com>
> To: <mcuss at cdlsystems.com>
> Cc: <ext3-users at redhat.com>
> Sent: Wednesday, April 14, 2004 1:08 PM
> Subject: Re: Question about EXT3 error messages in /var/log/messages
> 
> 
> > Mark C.,
> >
> > Try
> >
> > ls -l /dev/ | grep "9,   2"
> >
> > -Mark B.
> >
> > On Wed, 2004-04-14 at 14:43, Mark Cuss wrote:
> > > Hello list
> > >
> > > I've been having the following error messages pop up in my kernel log:
> > >
> > > Apr 12 04:08:09 hal kernel: EXT3-fs error (device md(9,2)):
> ext3_readdir:
> > > bad entry in directory #2670595: rec_len %% 4 != 0 - offset=0,
> > > inode=827218527, rec_len=20275, name_len=73
> > > Apr 12 04:08:14 hal kernel: EXT3-fs error (device md(9,2)):
> ext3_readdir:
> > > bad entry in directory #2670596: rec_len %% 4 != 0 - offset=0,
> > > inode=861103477, rec_len=95, name_len=95
> > > Apr 12 04:08:17 hal kernel: EXT3-fs error (device md(9,2)):
> ext3_readdir:
> > > bad entry in directory #2670597: rec_len %% 4 != 0 - offset=0,
> > > inode=1601531495, rec_len=30819, name_len=120
> > > Apr 12 04:08:20 hal kernel: EXT3-fs error (device md(9,2)):
> ext3_readdir:
> > > bad entry in directory #2670598: rec_len %% 4 != 0 - offset=0,
> > > inode=1634890872, rec_len=29795, name_len=111
> > > Apr 12 04:08:32 hal kernel: EXT3-fs error (device md(9,2)):
> ext3_readdir:
> > > bad entry in directory #2670599: rec_len %% 4 != 0 - offset=0,
> > > inode=1951614277, rec_len=12337, name_len=95
> > >
> > > I've done some searching and talked to some people on another mailing
> > > list...  I can't seem to figure out which device these errors occur
> on...
> > > I'd been told that the "device md(9,2)" ID indicates major 9 and minor
> 2,
> > > and I've also been told that the 9 is the SCSI channel once 8 is
> subtracted
> > > from is (so, 1 in this case), and that 2 is the SCSI Id of the offending
> > > device in that channel.
> > >
> > > So, I'm a little lost here.  I figured the experts here could let me
> know
> > > how to map the reported ID numbers to a physical disk or RAID device.
> > >
> > > Thanks in advance!
> > >
> > > Mark
> > >
> > > Mark Cuss, B. Sc.
> > > Real Time Systems Analyst
> > > System Administrator
> > > CDL Systems Ltd
> > > Suite 230
> > > 3553 - 31 Street NW
> > > Calgary, AB, Canada
> > >
> > > Phone: 403 289 1733 ext 226
> > > Fax: 403 282 1238
> > > www.cdlsystems.com
> > >
> > >
> > >
> > > _______________________________________________
> > > Ext3-users mailing list
> > > Ext3-users at redhat.com
> > > https://www.redhat.com/mailman/listinfo/ext3-users
> > >
> >
> >
> 
> 
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
> 




From gianluca.cecchi at hp.com  Thu Apr 15 06:51:25 2004
From: gianluca.cecchi at hp.com (Cecchi, Gianluca)
Date: Thu, 15 Apr 2004 08:51:25 +0200
Subject: Question about EXT3 error messages in /var/log/messages
Message-ID: <5C3713B4B134BA49AFEAF154D0A1EE1C39A7A1@mlnexc01.emea.cpqcorp.net>



>Also, I'm not sure exactly how to directly get the directory name for an
>inode, but you can list the contents of that directory, and go from
>there.

If the inode table is not corrupted, even if it may be time consuming, 
you can use the -inum switch of find comand, for each file system involved.

gcecchi at pc-gcecchi:~$ ll -lid /home/gcecchi
776220 drwx------  73 gcecchi users 4096 Apr 15 08:37 /home/gcecchi/

gcecchi at pc-gcecchi:/tmp# find /home -inum 776220
/home/gcecchi

Gianluca




From mnoman at hblpk.com  Thu Apr 15 08:13:40 2004
From: mnoman at hblpk.com (Muhammad Noman)
Date: Thu, 15 Apr 2004 13:13:40 +0500
Subject: Conversion from ext2 to ext3
Message-ID: <039101c422c1$98d5d6c0$0600a8c0@noman>

Dear All
Greetings.

I have a question regarding ext3 file system. I have installed Red Hat linux 8 with ext2 file system and I have multiple partition. Now I want to convert them to ext3 file system without distrubing my data.

If any one got the idea, please let me know.

Thanks


Muhammad Noman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20040415/824cdd9a/attachment.htm>

From sct at redhat.com  Thu Apr 15 10:55:26 2004
From: sct at redhat.com (Stephen C. Tweedie)
Date: 15 Apr 2004 11:55:26 +0100
Subject: Conversion from ext2 to ext3
In-Reply-To: <039101c422c1$98d5d6c0$0600a8c0@noman>
References: <039101c422c1$98d5d6c0$0600a8c0@noman>
Message-ID: <1082026526.2100.45.camel@sisko.scot.redhat.com>

Hi,

On Thu, 2004-04-15 at 09:13, Muhammad Noman wrote:
 
> I have a question regarding ext3 file system. I have installed Red Hat
> linux 8 with ext2 file system and I have multiple partition. Now I
> want to convert them to ext3 file system without distrubing my data.
 
> If any one got the idea, please let me know.

"man tune2fs" --- you want the "tune2fs -j" option.

Cheers,
 Stephen





From manuaroste at yahoo.es  Thu Apr 15 10:58:31 2004
From: manuaroste at yahoo.es (=?iso-8859-1?q?Manuel=20Arostegui=20Ramirez?=)
Date: Thu, 15 Apr 2004 12:58:31 +0200 (CEST)
Subject: Conversion from ext2 to ext3
In-Reply-To: <039101c422c1$98d5d6c0$0600a8c0@noman>
Message-ID: <20040415105831.61328.qmail@web60103.mail.yahoo.com>

 --- Muhammad Noman <mnoman at hblpk.com> escribi?: >
Dear All
> Greetings.
> 
> I have a question regarding ext3 file system. I have
> installed Red Hat linux 8 with ext2 file system and
> I have multiple partition. Now I want to convert
> them to ext3 file system without distrubing my data.
> 
> If any one got the idea, please let me know.
> 
> Thanks
Try this:
http://puggy.symonds.net/~rajesh/howto/ext3/toc.html



=====
--

Manuel Ar?stegui Linux user 200896
http://manuel.todo-linux.com


		
______________________________________________________________________
Correo Yahoo! - 6MB, m?s protecci?n contra el spam ?Gratis!
http://correo.yahoo.es




From sct at redhat.com  Thu Apr 15 11:00:42 2004
From: sct at redhat.com (Stephen C. Tweedie)
Date: 15 Apr 2004 12:00:42 +0100
Subject: Question about EXT3 error messages in /var/log/messages
In-Reply-To: <20ac01c42254$17343070$ab0e10ac@pinchy>
References: <209f01c42250$531e76d0$ab0e10ac@pinchy>
	<1081969716.29735.105.camel@mbasil.alabanza.com>
	<20ac01c42254$17343070$ab0e10ac@pinchy>
Message-ID: <1082026842.2100.49.camel@sisko.scot.redhat.com>

Hi,

On Wed, 2004-04-14 at 20:10, Mark Cuss wrote:
> Okay - so it is the major and minor numbers - thanks!  That means that md2
> is the culprit...
> 
> Does this mean that I have a drive failing in this raid or could the
> filesystem just need an fsck?

A drive failing should show up as IO errors in the logs, and the md
layer automatically switches out drives which give errors.  So it's not
a drive failing in the usual sense.

You've just got corrupt metadata on disk.  How it got there is pure
speculation --- the disk, controller, memory, CPU or software might be
at fault, and it's impossible to tell at this point.  But a fsck is
definitely recommended, as some types of on-disk corruption can spread,
corrupting other data as time goes on (in particular, if an indirect
block or bitmap gets corrupted then disk blocks belonging to one file
can get overwritten by being reallocated to another file.)  You don't
want to wait for that to happen!

Cheers,
 Stephen





From sct at redhat.com  Thu Apr 15 11:12:34 2004
From: sct at redhat.com (Stephen C. Tweedie)
Date: 15 Apr 2004 12:12:34 +0100
Subject: ext3_free_data()
In-Reply-To: <20040401094049.37281.qmail@web60907.mail.yahoo.com>
References: <20040401094049.37281.qmail@web60907.mail.yahoo.com>
Message-ID: <1082027554.2100.54.camel@sisko.scot.redhat.com>

Hi,

On Thu, 2004-04-01 at 10:40, Scorpion Yang wrote:

> why is there a window between  statements "err =
> ext3_journal_get_write_access(handle, this_bh);" and
> "ext3_journal_dirty_metadata(handle, this_bh);" ?

That's the whole point of the journaling mechanism.  To modify a block
of metadata on the ext3 filesystem, you need to get write access first,
then change the buffer, then queue the changes for the journal.  There
*has* to be a window between getting the write access and committing it,
because that's the window where the buffer is allowed to be changed!

The issue is that ext3 tries to do zero-copy writing of metadata to the
journal whenever possible.  So, a given buffer_head at some point in
time might actually be waiting to be written to the journal.  To
maintain journal integrity, we cannot touch that data --- so we need to
do a "get_write_access()" first so that the jbd layer can copy out the
data that's destined for the journal.  The caller can then safely modify
the primary contents of the buffer_head.
 
> In my test, there is an error in journal_dirty_metadata: jh is null

In what context?

Cheers,
 Stephen





From sct at redhat.com  Thu Apr 15 11:34:03 2004
From: sct at redhat.com (Stephen C. Tweedie)
Date: 15 Apr 2004 12:34:03 +0100
Subject: Strange Fedora Booting problem: can not mount "LABEL=*"
	partitions
In-Reply-To: <41089CB27BD8D24E8385C8003EDAF7AB084841@karl.alexa.com>
References: <41089CB27BD8D24E8385C8003EDAF7AB084841@karl.alexa.com>
Message-ID: <1082028843.2100.63.camel@sisko.scot.redhat.com>

Hi,

On Fri, 2004-04-02 at 07:36, Guolin Cheng wrote:

>  FC1 stops on partitions "LABEL=/var" on two machines, stops on
> partition "LABEL=/" on the 3rd machine. 

When it "stops", what error does it show?

> While the default|upgraded NTPL kernel (with SMP problem) boots
> without a glitch, my vanilla 2.4.25 kernel plus libata patch
> 2.4.25-libata8 fails with the above symptoms described.

What happens without the libata patch?

>  The solution to fix it is:  manually run "e2fsck -y -f  /dev/hd?,
> tune2fs -j /dev/hd?; e2label /dev/hd? <LABEL>" again even there is no
> problem with file system, journal node and ext2 label, then reboot. 

Very very odd --- that really helps, every time?

--Stephen





From guolin at alexa.com  Thu Apr 15 17:50:03 2004
From: guolin at alexa.com (Guolin Cheng)
Date: Thu, 15 Apr 2004 10:50:03 -0700
Subject: Strange Fedora Booting problem: can not mount
	"LABEL=*"partitions
Message-ID: <41089CB27BD8D24E8385C8003EDAF7AB084856@karl.alexa.com>

Hi, Stephen and Jeff,

 Thanks. But the problem got debugged&fixed, the answer was post on
fedora-list about 2 weeks ago. 

The problem is: the /etc/blkid.tab file works as an old unappropriate
disk partitions cache for fsck|blkid commands when stystem image is
installed to a different arch (scsi->ide) machine, the old cache will
mislead fsck|blkid at the first run and only the first run, since the
first run will update /etc/blkid.tab file. 

As you know, systemimager and similar methods use a single system image
to clone hundreds of, or thousands of machines very quickly and
reliablely, but this time for Fedora, the /etc/blkid.tab should be
cleared off any existing old disk partitions cache when the
source&destination machines have different types of hard disks.

So, another solution, maybe a better solution, is to patch the e2fsprogs
package, so that the blkid* library routines will ignore the cache
contents in /etc/blkid.tab, just exactly like the existing "blkid -c
/dev/null" does.  


Thanks.
--Guolin Cheng


-----Original Message-----
From: Stephen C. Tweedie [mailto:sct at redhat.com] 
Sent: Thursday, April 15, 2004 4:34 AM
To: Guolin Cheng
Cc: Fedora (E-mail); Redhat Ext3 (E-mail); Jeff Garzik; Stephen Tweedie
Subject: Re: Strange Fedora Booting problem: can not mount
"LABEL=*"partitions

Hi,

On Fri, 2004-04-02 at 07:36, Guolin Cheng wrote:

>  FC1 stops on partitions "LABEL=/var" on two machines, stops on
> partition "LABEL=/" on the 3rd machine. 

When it "stops", what error does it show?

> While the default|upgraded NTPL kernel (with SMP problem) boots
> without a glitch, my vanilla 2.4.25 kernel plus libata patch
> 2.4.25-libata8 fails with the above symptoms described.

What happens without the libata patch?

>  The solution to fix it is:  manually run "e2fsck -y -f  /dev/hd?,
> tune2fs -j /dev/hd?; e2label /dev/hd? <LABEL>" again even there is no
> problem with file system, journal node and ext2 label, then reboot. 

Very very odd --- that really helps, every time?

--Stephen






From mcuss at cdlsystems.com  Thu Apr 15 19:44:02 2004
From: mcuss at cdlsystems.com (Mark Cuss)
Date: Thu, 15 Apr 2004 13:44:02 -0600
Subject: Question about EXT3 error messages in /var/log/messages
References: <209f01c42250$531e76d0$ab0e10ac@pinchy>
	<1081969716.29735.105.camel@mbasil.alabanza.com>
	<20ac01c42254$17343070$ab0e10ac@pinchy>
	<1082026842.2100.49.camel@sisko.scot.redhat.com>
Message-ID: <232701c42322$00f939d0$ab0e10ac@pinchy>

Thanks Stephen - I'll be sure to boot everyone off the volume, unmount it,
and do an fsck tomorrow morning.

Thanks
Mark

----- Original Message ----- 
From: "Stephen C. Tweedie" <sct at redhat.com>
To: <mcuss at cdlsystems.com>
Cc: <mbasil at alabanza.com>; "ext3 users list" <ext3-users at redhat.com>
Sent: Thursday, April 15, 2004 5:00 AM
Subject: Re: Question about EXT3 error messages in /var/log/messages


> Hi,
>
> On Wed, 2004-04-14 at 20:10, Mark Cuss wrote:
> > Okay - so it is the major and minor numbers - thanks!  That means that
md2
> > is the culprit...
> >
> > Does this mean that I have a drive failing in this raid or could the
> > filesystem just need an fsck?
>
> A drive failing should show up as IO errors in the logs, and the md
> layer automatically switches out drives which give errors.  So it's not
> a drive failing in the usual sense.
>
> You've just got corrupt metadata on disk.  How it got there is pure
> speculation --- the disk, controller, memory, CPU or software might be
> at fault, and it's impossible to tell at this point.  But a fsck is
> definitely recommended, as some types of on-disk corruption can spread,
> corrupting other data as time goes on (in particular, if an indirect
> block or bitmap gets corrupted then disk blocks belonging to one file
> can get overwritten by being reallocated to another file.)  You don't
> want to wait for that to happen!
>
> Cheers,
>  Stephen
>
>
>





From kfitzner at excelcia.org  Fri Apr 16 14:20:13 2004
From: kfitzner at excelcia.org (Kurt Fitzner)
Date: Fri, 16 Apr 2004 08:20:13 -0600
Subject: EXT3 on raid with external journal...
In-Reply-To: <Pine.LNX.4.58.0404141002560.3742@lucy.dcs.qmul.ac.uk>
References: <407BC964.4030206@excelcia.org>
	<Pine.LNX.4.58.0404141002560.3742@lucy.dcs.qmul.ac.uk>
Message-ID: <407FEB9D.1000002@excelcia.org>

Matt Bernstein wrote:
> On Apr 13 Kurt Fitzner wrote:
> 
> There could be metadata which is only in the journal, so failure probably 
> means reboot + full fsck, so you may as well use ext2 if your machine 
> doesn't otherwise crash.
> 
> Far preferable, I think, would be to put your journal on a RAID 1 pair.

I would like to think that if the ext3 driver encountered an error 
writing to the journal, that it would then skip the journal and write 
straight to the device - reverting to ext2 behavior.  There should never 
be any loss of data (meta or otherwise) upon the failure of a journal 
device.  That is, unless the failure of the journaling device coincides 
with a power failure.  That is:

1) Failure of journaling device
2) Attempted write of metadata to journal device
3) Power failure before ext3 gives up on the journaling device

In that scenario, the ramification is the array requiring a full fsck. 
The benefit of running the journal on an external device would far 
outweigh the cost of a full fsck in the unlikely event the above happens.

I need to know, though, what exactly is the behavior of ext3 in the 
following situations:
  - At system startup if there is a failure to "mount" an external journal
  - During operation if the external journal device fails.

Does ext3 then revert to non-journaled (ext2) behavior in those instances?

  -




From mcuss at cdlsystems.com  Fri Apr 16 14:33:18 2004
From: mcuss at cdlsystems.com (Mark Cuss)
Date: Fri, 16 Apr 2004 08:33:18 -0600
Subject: Question about EXT3 error messages in /var/log/messages
References: <209f01c42250$531e76d0$ab0e10ac@pinchy>
	<1081969716.29735.105.camel@mbasil.alabanza.com>
	<20ac01c42254$17343070$ab0e10ac@pinchy>
	<1082026842.2100.49.camel@sisko.scot.redhat.com>
Message-ID: <244d01c423bf$c34ee5f0$ab0e10ac@pinchy>

Ok - I did the fsck this morning, and there were definitely some problems on
the filesystem (ie - duplicate inodes, wrong reference counts, etc.) but
fsck seemed to clean them up fine...

This filesystem is on two disks configured in a striping RAID...  These
disks were originally in my old server, but I plugged them into my new SCSI
disk array and the system recognized the md device and everything seemed
OK - I guess I should've known to do an fsck before putting the system into
service...

Thanks again
Mark

----- Original Message ----- 
From: "Stephen C. Tweedie" <sct at redhat.com>
To: <mcuss at cdlsystems.com>
Cc: <mbasil at alabanza.com>; "ext3 users list" <ext3-users at redhat.com>
Sent: Thursday, April 15, 2004 5:00 AM
Subject: Re: Question about EXT3 error messages in /var/log/messages


> Hi,
>
> On Wed, 2004-04-14 at 20:10, Mark Cuss wrote:
> > Okay - so it is the major and minor numbers - thanks!  That means that
md2
> > is the culprit...
> >
> > Does this mean that I have a drive failing in this raid or could the
> > filesystem just need an fsck?
>
> A drive failing should show up as IO errors in the logs, and the md
> layer automatically switches out drives which give errors.  So it's not
> a drive failing in the usual sense.
>
> You've just got corrupt metadata on disk.  How it got there is pure
> speculation --- the disk, controller, memory, CPU or software might be
> at fault, and it's impossible to tell at this point.  But a fsck is
> definitely recommended, as some types of on-disk corruption can spread,
> corrupting other data as time goes on (in particular, if an indirect
> block or bitmap gets corrupted then disk blocks belonging to one file
> can get overwritten by being reallocated to another file.)  You don't
> want to wait for that to happen!
>
> Cheers,
>  Stephen
>
>
>





From stoffel at lucent.com  Fri Apr 16 16:13:24 2004
From: stoffel at lucent.com (John Stoffel)
Date: Fri, 16 Apr 2004 12:13:24 -0400
Subject: online resize of ext3 possible?
Message-ID: <16512.1572.690002.225379@gargle.gargle.HOWL>


Hi folks,

Is it possible to resize an ext3 filesystem while it's online?  It
looks like resize2fs won't do the trick unless the filesystem is
unmounted.  And ext2resize takes one look at the filesystem while it's
mounted and complains as well, this time about un-supported features.

It's not a huge deal if I have to shutdown the system to grow the two
filesystems, it's just more annoying.

I'm running Debian Unstable, with alot of /unstable and /testing
patches.  It's upto date as of last night too.  

Thanks,
John




From sct at redhat.com  Fri Apr 16 21:19:23 2004
From: sct at redhat.com (Stephen C. Tweedie)
Date: 16 Apr 2004 22:19:23 +0100
Subject: [patch] Re: stalled 'sync' on ext3+quota over drbd
In-Reply-To: <1081255826.22308.57.camel@ariel.sovam.com>
References: <1080125239.4717.33.camel@ariel.sovam.com>
	<1080737188.1991.9.camel@sisko.scot.redhat.com>
	<1080738345.22942.53.camel@ariel.sovam.com>
	<1080740974.1991.28.camel@sisko.scot.redhat.com>
	<1081177587.7677.110.camel@ariel.sovam.com>
	<1081255826.22308.57.camel@ariel.sovam.com>
Message-ID: <1082150363.2081.85.camel@sisko.scot.redhat.com>

Hi,

On Tue, 2004-04-06 at 13:50, Eugene Crosser wrote:
> More representative statistics for my "quota on ext3" trouble:
> 
> after moving about 10,000 files and setting quota for a million
> groupids, and then several hours of inactivity(!) I zeroed profile
> counters (readprofile -r), ran `time sync' and then `readprofile'.  Here
> are the results.  Yes, that's true, it took 3 (three) hours for `sync'
> to complete!

Turns out there's a nasty O(N^2) behaviour in vfs_quota_sync().  That
function walks the dquot list looking for things to sync, and it drops
the lock when doing the actual syncing --- so each item synced causes it
to start again at the beginning of the list.  If each item starts off
dirty, then the list walk is N^2.

An obvious cure is to shift the start of the list to point just after
the item just synced.  I've done only limited testing of this patch, but
does it help for you?

2.4 and 2.6 seem to share this problem.

Cheers,
 Stephen

-------------- next part --------------
--- linux-2.4/fs/dquot.c.=K0000=.orig
+++ linux-2.4/fs/dquot.c
@@ -397,6 +397,10 @@ restart:
 			wait_on_dquot(dquot);
 		if (dquot_dirty(dquot))
 			sb->dq_op->write_dquot(dquot);
+		/* Move the inuse_list head pointer to just after the
+		 * current dquot, so that we'll restart the list walk
+		 * after this point on the next pass. */
+		list_move(&inuse_list, &dquot->dq_inuse);
 		dqput(dquot);
 		goto restart;
 	}

From crosser at rol.ru  Sat Apr 17 10:40:52 2004
From: crosser at rol.ru (Eugene Crosser)
Date: Sat, 17 Apr 2004 14:40:52 +0400
Subject: [patch] Re: stalled 'sync' on ext3+quota over drbd
In-Reply-To: <1082150363.2081.85.camel@sisko.scot.redhat.com>
References: <1080125239.4717.33.camel@ariel.sovam.com>
	<1080737188.1991.9.camel@sisko.scot.redhat.com>
	<1080738345.22942.53.camel@ariel.sovam.com>
	<1080740974.1991.28.camel@sisko.scot.redhat.com>
	<1081177587.7677.110.camel@ariel.sovam.com>
	<1081255826.22308.57.camel@ariel.sovam.com>
	<1082150363.2081.85.camel@sisko.scot.redhat.com>
Message-ID: <1082198452.20346.25.camel@pccross.average.org>

On Sat, 2004-04-17 at 01:19, Stephen C. Tweedie wrote:

> > after moving about 10,000 files and setting quota for a million
> > groupids, and then several hours of inactivity(!) I zeroed profile
> > counters (readprofile -r), ran `time sync' and then `readprofile'.  Here
> > are the results.  Yes, that's true, it took 3 (three) hours for `sync'
> > to complete!
> 
> Turns out there's a nasty O(N^2) behaviour in vfs_quota_sync().  That
> function walks the dquot list looking for things to sync, and it drops
> the lock when doing the actual syncing --- so each item synced causes it
> to start again at the beginning of the list.  If each item starts off
> dirty, then the list walk is N^2.
> 
> An obvious cure is to shift the start of the list to point just after
> the item just synced.  I've done only limited testing of this patch, but
> does it help for you?

Cool!  I've already began to build testing environment with oprofile
enabled ;-)  During the weekend, I am out of the office, but I'll
certainly verify your fix on Monday.

> 2.4 and 2.6 seem to share this problem.

Apparently things are worse 2.6.  I have an impression (did not check it
yet) that 2.6.5 still suffers from the same deadlock problem that was
fixed in 2.4.24 -> 2.4.25 diff.

Unrelated question: is quotacheck necessary after mounting an ext3 to
ensure consistent status?  I am building a 200Gb HA NFS server hosting a
million or two files that belong to 1/3 million userids; failover
without fsck and quotacheck takes about 30-40 seconds which is pretty
good.  fsck on this filesystem takes about 7 minutes, quotacheck - about
4 minutes.  So, having to run quotacheck has significant impact in
availability...

Eugene




From tytso at mit.edu  Mon Apr 19 02:53:49 2004
From: tytso at mit.edu (Theodore Ts'o)
Date: Sun, 18 Apr 2004 22:53:49 -0400
Subject: Strange Fedora Booting problem: can not mount
	"LABEL=*"partitions
In-Reply-To: <41089CB27BD8D24E8385C8003EDAF7AB084856@karl.alexa.com>
References: <41089CB27BD8D24E8385C8003EDAF7AB084856@karl.alexa.com>
Message-ID: <20040419025349.GB323@thunk.org>

On Thu, Apr 15, 2004 at 10:50:03AM -0700, Guolin Cheng wrote:
> Hi, Stephen and Jeff,
> 
>  Thanks. But the problem got debugged&fixed, the answer was post on
> fedora-list about 2 weeks ago. 
> 
> The problem is: the /etc/blkid.tab file works as an old unappropriate
> disk partitions cache for fsck|blkid commands when stystem image is
> installed to a different arch (scsi->ide) machine, the old cache will
> mislead fsck|blkid at the first run and only the first run, since the
> first run will update /etc/blkid.tab file. 

Huh?  It shouldn't do that.  The blkid library validates the
information before it returns it.  Let me check, just to make sure I'm
not going insane:

# grep usr /etc/blkid.tab
<device DEVNO="0x0303" TIME="1082342948" UUID="afc6b073-ad8b-4440-931d-5558e3618fa9" SEC_TYPE="ext3" TYPE="ext2" LABEL="usr">/dev/hda3</device>
# grep usr /tmp/blkid.tab.broken
<device DEVNO="0x0305" TIME="1082342842" UUID="eaf43bde-8da2-4844-aed7-80729e93bd13" SEC_TYPE="ext3" TYPE="ext2" LABEL="usr">/dev/hda5</device>
# cp /tmp/blkid.tab.broken /etc/blkid.tab
# fsck -VN LABEL=usr
fsck 1.35 (28-Feb-2004)
[/sbin/fsck.ext3 (1) -- /usr] fsck.ext3 /dev/hda3
# e2label /dev/hda3
usr

In the above example, you'll see that /etc/blkid.tab shows that
/dev/hda3 has the label "usr".  In /tmp/blkid.tab.broken, it thinks
that /dev/hda5 has the label "usr".  I then copy /tmp/blkid.tab.broken
to /etc/blkid.tab, and then try to do an fsck test.  You'll see that
it uses /dev/hda3, not /dev/hda5.

This is because the blkid library always uses the information in
/etc/blkid.tab as nothing but a hint --- which it verifies before it
returns the correct value.  Even if the /etc/blkid.tab file is
completely bogus; that should be OK.  The blkid library should be able
to recover from this situation just fine.

Can you give me more information about why you think the blkid library
isn't working correctly?

						- Ted




From crosser at rol.ru  Mon Apr 19 14:37:24 2004
From: crosser at rol.ru (Eugene Crosser)
Date: Mon, 19 Apr 2004 18:37:24 +0400
Subject: [patch] Re: stalled 'sync' on ext3+quota over drbd
In-Reply-To: <20040419133807.GB15541@atrey.karlin.mff.cuni.cz>
References: <1080125239.4717.33.camel@ariel.sovam.com>
	<1080737188.1991.9.camel@sisko.scot.redhat.com>
	<1080738345.22942.53.camel@ariel.sovam.com>
	<1080740974.1991.28.camel@sisko.scot.redhat.com>
	<1081177587.7677.110.camel@ariel.sovam.com>
	<1081255826.22308.57.camel@ariel.sovam.com>
	<1082150363.2081.85.camel@sisko.scot.redhat.com>
	<1082198452.20346.25.camel@pccross.average.org>
	<20040419133807.GB15541@atrey.karlin.mff.cuni.cz>
Message-ID: <1082385443.17175.183.camel@ariel.sovam.com>

On Mon, 2004-04-19 at 17:38, Jan Kara wrote:

> > > > after moving about 10,000 files and setting quota for a million
> > > > groupids, and then several hours of inactivity(!) I zeroed profile
> > > > counters (readprofile -r), ran `time sync' and then `readprofile'.  Here
> > > > are the results.  Yes, that's true, it took 3 (three) hours for `sync'
> > > > to complete!
> > > 
> > > Turns out there's a nasty O(N^2) behaviour in vfs_quota_sync().  That
> > > function walks the dquot list looking for things to sync, and it drops
> > > the lock when doing the actual syncing --- so each item synced causes it
> > > to start again at the beginning of the list.  If each item starts off
> > > dirty, then the list walk is N^2.
> > > 
> > > An obvious cure is to shift the start of the list to point just after
> > > the item just synced.  I've done only limited testing of this patch, but
> > > does it help for you?
> > 
> > Cool!  I've already began to build testing environment with oprofile
> > enabled ;-)  During the weekend, I am out of the office, but I'll
> > certainly verify your fix on Monday.
>   Do you already have results? I'd be interested in them...

From the first impression, it did not help.  But it takes a full day of
copying around files to reproduce that nasty 3hr sync.  So far, after a
couple hours of activity, sync takes 4+ minutes (99.9 cpu use of course)
which is approximately the same as it took before the patch.  But I will
only know for sure tomorrow.

> > > 2.4 and 2.6 seem to share this problem.
> > 
> > Apparently things are worse 2.6.  I have an impression (did not check it
> > yet) that 2.6.5 still suffers from the same deadlock problem that was
> > fixed in 2.4.24 -> 2.4.25 diff.
>   2.6.5 should have the same fixes as 2.4.25 wrt ext3. What deadlock
> do you see? There are some more bugfixes on a way to Linus which fix
> some possible deadlocks but I think they should be hard to trigger.

I only observed it once, on my workstation (2.6.5) where I was setting
up oprofile environment.  I created 10,000 files belonging to 10,000
uids (with quota set for all of them), and ran 'sync'.  The system
worked for another 10 or 20 minutes, 'sync' did not finish but *was not*
using any cpu, being in 'D' state.  Then the system hung and since it
was in X11 I do not have any stack trace or anything.  I did not try to
reproduce it yet, but I will.

> > Unrelated question: is quotacheck necessary after mounting an ext3 to
> > ensure consistent status?  I am building a 200Gb HA NFS server hosting a
> > million or two files that belong to 1/3 million userids; failover
> > without fsck and quotacheck takes about 30-40 seconds which is pretty
> > good.  fsck on this filesystem takes about 7 minutes, quotacheck - about
> > 4 minutes.  So, having to run quotacheck has significant impact in
> > availability...
>   You need to run quotacheck only if you didn't correctly unmount the
> filesystem. I've written journalled quota patch which removes the need
> of running quotacheck after unclean shutdown.

Of couse I am only interested in recovery after unclean shutdown (this
is a HA server).  I was hoping that maybe quota changes are logged along
with the rest of filesystem changes...  I.e. that your "journalled
quota" is already in the mainstream kernel.

> It is currently included
> in Andrew Morton's kernels (-mm tree) and maybe it will be in vanilla
> kernels but that depends on Linus. The quota fix and journalled quota
> patch are attached if you are interested... The patches are against
> 2.6.4 but should apply to 2.6.5 well.

Hmm, as DRBD supports 2.6 nowdays, I might give it a try...

Eugene
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20040419/f612afe1/attachment.sig>

From jack at ucw.cz  Mon Apr 19 13:29:19 2004
From: jack at ucw.cz (Jan Kara)
Date: Mon, 19 Apr 2004 15:29:19 +0200
Subject: [patch] Re: stalled 'sync' on ext3+quota over drbd
In-Reply-To: <1082150363.2081.85.camel@sisko.scot.redhat.com>
References: <1080125239.4717.33.camel@ariel.sovam.com>
	<1080737188.1991.9.camel@sisko.scot.redhat.com>
	<1080738345.22942.53.camel@ariel.sovam.com>
	<1080740974.1991.28.camel@sisko.scot.redhat.com>
	<1081177587.7677.110.camel@ariel.sovam.com>
	<1081255826.22308.57.camel@ariel.sovam.com>
	<1082150363.2081.85.camel@sisko.scot.redhat.com>
Message-ID: <20040419132919.GA15541@atrey.karlin.mff.cuni.cz>

  Hi,

> On Tue, 2004-04-06 at 13:50, Eugene Crosser wrote:
> > More representative statistics for my "quota on ext3" trouble:
> > 
> > after moving about 10,000 files and setting quota for a million
> > groupids, and then several hours of inactivity(!) I zeroed profile
> > counters (readprofile -r), ran `time sync' and then `readprofile'.  Here
> > are the results.  Yes, that's true, it took 3 (three) hours for `sync'
> > to complete!
> 
> Turns out there's a nasty O(N^2) behaviour in vfs_quota_sync().  That
> function walks the dquot list looking for things to sync, and it drops
> the lock when doing the actual syncing --- so each item synced causes it
> to start again at the beginning of the list.  If each item starts off
> dirty, then the list walk is N^2.
> 
> An obvious cure is to shift the start of the list to point just after
> the item just synced.  I've done only limited testing of this patch, but
> does it help for you?
> 
> 2.4 and 2.6 seem to share this problem.
  Yes, both 2.4 and 2.6 have this problem. I've just never seen it
reported. Your fix should work although it relies a bit on the fact that
there are no other users of inuse list which would be non-atomical...
I'll try to think of something which would not rely on this fact and is
reasonably easy to implement.

						Thanks for fix
								Honza




From jack at ucw.cz  Mon Apr 19 13:38:07 2004
From: jack at ucw.cz (Jan Kara)
Date: Mon, 19 Apr 2004 15:38:07 +0200
Subject: [patch] Re: stalled 'sync' on ext3+quota over drbd
In-Reply-To: <1082198452.20346.25.camel@pccross.average.org>
References: <1080125239.4717.33.camel@ariel.sovam.com>
	<1080737188.1991.9.camel@sisko.scot.redhat.com>
	<1080738345.22942.53.camel@ariel.sovam.com>
	<1080740974.1991.28.camel@sisko.scot.redhat.com>
	<1081177587.7677.110.camel@ariel.sovam.com>
	<1081255826.22308.57.camel@ariel.sovam.com>
	<1082150363.2081.85.camel@sisko.scot.redhat.com>
	<1082198452.20346.25.camel@pccross.average.org>
Message-ID: <20040419133807.GB15541@atrey.karlin.mff.cuni.cz>

  Hello,

> > > after moving about 10,000 files and setting quota for a million
> > > groupids, and then several hours of inactivity(!) I zeroed profile
> > > counters (readprofile -r), ran `time sync' and then `readprofile'.  Here
> > > are the results.  Yes, that's true, it took 3 (three) hours for `sync'
> > > to complete!
> > 
> > Turns out there's a nasty O(N^2) behaviour in vfs_quota_sync().  That
> > function walks the dquot list looking for things to sync, and it drops
> > the lock when doing the actual syncing --- so each item synced causes it
> > to start again at the beginning of the list.  If each item starts off
> > dirty, then the list walk is N^2.
> > 
> > An obvious cure is to shift the start of the list to point just after
> > the item just synced.  I've done only limited testing of this patch, but
> > does it help for you?
> 
> Cool!  I've already began to build testing environment with oprofile
> enabled ;-)  During the weekend, I am out of the office, but I'll
> certainly verify your fix on Monday.
  Do you already have results? I'd be interested in them...

> > 2.4 and 2.6 seem to share this problem.
> 
> Apparently things are worse 2.6.  I have an impression (did not check it
> yet) that 2.6.5 still suffers from the same deadlock problem that was
> fixed in 2.4.24 -> 2.4.25 diff.
  2.6.5 should have the same fixes as 2.4.25 wrt ext3. What deadlock
do you see? There are some more bugfixes on a way to Linus which fix
some possible deadlocks but I think they should be hard to trigger.

> Unrelated question: is quotacheck necessary after mounting an ext3 to
> ensure consistent status?  I am building a 200Gb HA NFS server hosting a
> million or two files that belong to 1/3 million userids; failover
> without fsck and quotacheck takes about 30-40 seconds which is pretty
> good.  fsck on this filesystem takes about 7 minutes, quotacheck - about
> 4 minutes.  So, having to run quotacheck has significant impact in
> availability...
  You need to run quotacheck only if you didn't correctly unmount the
filesystem. I've written journalled quota patch which removes the need
of running quotacheck after unclean shutdown. It is currently included
in Andrew Morton's kernels (-mm tree) and maybe it will be in vanilla
kernels but that depends on Linus. The quota fix and journalled quota
patch are attached if you are interested... The patches are against
2.6.4 but should apply to 2.6.5 well.

								Honza
-------------- next part --------------
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4/fs/dquot.c linux-2.6.4-1-lockfix/fs/dquot.c
--- linux-2.6.4/fs/dquot.c	2004-03-04 09:26:37.000000000 +0100
+++ linux-2.6.4-1-lockfix/fs/dquot.c	2004-03-22 21:51:13.000000000 +0100
@@ -85,12 +85,31 @@
  * and quota formats and also dqstats structure containing statistics about the
  * lists. dq_data_lock protects data from dq_dqb and also mem_dqinfo structures
  * and also guards consistency of dquot->dq_dqb with inode->i_blocks, i_bytes.
- * Note that we don't have to do the locking of i_blocks and i_bytes when the
- * quota is disabled - i_sem should serialize the access. dq_data_lock should
- * be always grabbed before dq_list_lock.
+ * i_blocks and i_bytes updates itself are guarded by i_lock acquired directly
+ * in inode_add_bytes() and inode_sub_bytes().
+ *
+ * The spinlock ordering is hence: dq_data_lock > dq_list_lock > i_lock
  *
  * Note that some things (eg. sb pointer, type, id) doesn't change during
  * the life of the dquot structure and so needn't to be protected by a lock
+ *
+ * Any operation working on dquots via inode pointers must hold dqptr_sem.  If
+ * operation is just reading pointers from inode (or not using them at all) the
+ * read lock is enough. If pointers are altered function must hold write lock.
+ * If operation is holding reference to dquot in other way (e.g. quotactl ops)
+ * it must be guarded by dqonoff_sem.
+ * This locking assures that:
+ *   a) update/access to dquot pointers in inode is serialized
+ *   b) everyone is guarded against invalidate_dquots()
+ *
+ * Each dquot has its dq_lock semaphore. Locked dquots might not be referenced
+ * from inodes (dquot_alloc_space() and such don't check the dq_lock).
+ * Currently dquot is locked only when it is being read to memory on the first
+ * dqget(). Write operations on dquots don't hold dq_lock as they copy data
+ * under dq_data_lock spinlock to internal buffers before writing.
+ *
+ * Lock ordering (including journal_lock) is following:
+ *  dqonoff_sem > journal_lock > dqptr_sem > dquot->dq_lock > dqio_sem
  */
 spinlock_t dq_list_lock = SPIN_LOCK_UNLOCKED;
 spinlock_t dq_data_lock = SPIN_LOCK_UNLOCKED;
@@ -169,23 +188,6 @@
  * mechanism to locate a specific dquot.
  */
 
-/*
- * Note that any operation which operates on dquot data (ie. dq_dqb) must
- * hold dq_data_lock.
- *
- * Any operation working with dquots must hold dqptr_sem. If operation is
- * just reading pointers from inodes than read lock is enough. If pointers
- * are altered function must hold write lock.
- *
- * Locked dquots might not be referenced in inodes. Currently dquot it locked
- * only once in its existence - when it's being read to memory on first dqget()
- * and at that time it can't be referenced from inode. Write operations on
- * dquots don't hold dquot lock as they copy data to internal buffers before
- * writing anyway and copying as well as any data update should be atomic. Also
- * nobody can change used entries in dquot structure as this is done only when
- * quota is destroyed and invalidate_dquots() is called only when dq_count == 0.
- */
-
 static LIST_HEAD(inuse_list);
 static LIST_HEAD(free_dquots);
 static struct list_head dquot_hash[NR_DQHASH];
@@ -286,9 +288,9 @@
 }
 
 /* Invalidate all dquots on the list. Note that this function is called after
- * quota is disabled so no new quota might be created. Because we hold dqptr_sem
- * for writing and pointers were already removed from inodes we actually know that
- * no quota for this sb+type should be held. */
+ * quota is disabled so no new quota might be created. Because we hold
+ * dqonoff_sem and pointers were already removed from inodes we actually know
+ * that no quota for this sb+type should be held. */
 static void invalidate_dquots(struct super_block *sb, int type)
 {
 	struct dquot *dquot;
@@ -302,12 +304,11 @@
 			continue;
 		if (dquot->dq_type != type)
 			continue;
-#ifdef __DQUOT_PARANOIA	
-		/* There should be no users of quota - we hold dqptr_sem for writing */
+#ifdef __DQUOT_PARANOIA
 		if (atomic_read(&dquot->dq_count))
 			BUG();
 #endif
-		/* Quota now have no users and it has been written on last dqput() */
+		/* Quota now has no users and it has been written on last dqput() */
 		remove_dquot_hash(dquot);
 		remove_free_dquot(dquot);
 		remove_inuse(dquot);
@@ -323,7 +324,7 @@
 	struct quota_info *dqopt = sb_dqopt(sb);
 	int cnt;
 
-	down_read(&dqopt->dqptr_sem);
+	down(&dqopt->dqonoff_sem);
 restart:
 	/* At this point any dirty dquot will definitely be written so we can clear
 	   dirty flag from info */
@@ -359,7 +360,7 @@
 	spin_lock(&dq_list_lock);
 	dqstats.syncs++;
 	spin_unlock(&dq_list_lock);
-	up_read(&dqopt->dqptr_sem);
+	up(&dqopt->dqonoff_sem);
 
 	return 0;
 }
@@ -402,7 +403,7 @@
 /*
  * Put reference to dquot
  * NOTE: If you change this function please check whether dqput_blocks() works right...
- * MUST be called with dqptr_sem held
+ * MUST be called with either dqptr_sem or dqonoff_sem held
  */
 static void dqput(struct dquot *dquot)
 {
@@ -467,7 +468,7 @@
 
 /*
  * Get reference to dquot
- * MUST be called with dqptr_sem held
+ * MUST be called with either dqptr_sem or dqonoff_sem held
  */
 static struct dquot *dqget(struct super_block *sb, unsigned int id, int type)
 {
@@ -528,7 +529,7 @@
 	return 0;
 }
 
-/* This routine is guarded by dqptr_sem semaphore */
+/* This routine is guarded by dqonoff_sem semaphore */
 static void add_dquot_ref(struct super_block *sb, int type)
 {
 	struct list_head *p;
@@ -594,7 +595,7 @@
 
 /* Free list of dquots - called from inode.c */
 /* dquots are removed from inodes, no new references can be got so we are the only ones holding reference */
-void put_dquot_list(struct list_head *tofree_head)
+static void put_dquot_list(struct list_head *tofree_head)
 {
 	struct list_head *act_head;
 	struct dquot *dquot;
@@ -609,6 +610,20 @@
 	}
 }
 
+/* Function in inode.c - remove pointers to dquots in icache */
+extern void remove_dquot_ref(struct super_block *, int, struct list_head *);
+
+/* Gather all references from inodes and drop them */
+static void drop_dquot_ref(struct super_block *sb, int type)
+{
+	LIST_HEAD(tofree_head);
+
+	down_write(&sb_dqopt(sb)->dqptr_sem);
+	remove_dquot_ref(sb, type, &tofree_head);
+	up_write(&sb_dqopt(sb)->dqptr_sem);
+	put_dquot_list(&tofree_head);
+}
+
 static inline void dquot_incr_inodes(struct dquot *dquot, unsigned long number)
 {
 	dquot->dq_dqb.dqb_curinodes += number;
@@ -804,6 +819,9 @@
 	unsigned int id = 0;
 	int cnt;
 
+	/* Solve deadlock when we recurse when holding dqptr_sem... */
+	if (IS_NOQUOTA(inode))
+		return;
 	down_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
 	/* Having dqptr_sem we know NOQUOTA flags can't be altered... */
 	if (IS_NOQUOTA(inode)) {
@@ -832,49 +850,22 @@
 }
 
 /*
- *	Remove references to quota from inode
- *	This function needs dqptr_sem for writing
- */
-static void dquot_drop_iupdate(struct inode *inode, struct dquot **to_drop)
-{
-	int cnt;
-
-	inode->i_flags &= ~S_QUOTA;
-	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
-		to_drop[cnt] = inode->i_dquot[cnt];
-		inode->i_dquot[cnt] = NODQUOT;
-	}
-}
-
-/*
  * 	Release all quotas referenced by inode
+ *	Transaction must be started at an entry
  */
 void dquot_drop(struct inode *inode)
 {
-	struct dquot *to_drop[MAXQUOTAS];
 	int cnt;
 
 	down_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
-	dquot_drop_iupdate(inode, to_drop);
+	inode->i_flags &= ~S_QUOTA;
+	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
+		if (inode->i_dquot[cnt] != NODQUOT) {
+			dqput(inode->i_dquot[cnt]);
+			inode->i_dquot[cnt] = NODQUOT;
+		}
+	}
 	up_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
-	for (cnt = 0; cnt < MAXQUOTAS; cnt++)
-		if (to_drop[cnt] != NODQUOT)
-			dqput(to_drop[cnt]);
-}
-
-/*
- *	Release all quotas referenced by inode.
- *	This function assumes dqptr_sem for writing
- */
-void dquot_drop_nolock(struct inode *inode)
-{
-	struct dquot *to_drop[MAXQUOTAS];
-	int cnt;
-
-	dquot_drop_iupdate(inode, to_drop);
-	for (cnt = 0; cnt < MAXQUOTAS; cnt++)
-		if (to_drop[cnt] != NODQUOT)
-			dqput(to_drop[cnt]);
 }
 
 /*
@@ -885,11 +876,17 @@
 	int cnt, ret = NO_QUOTA;
 	char warntype[MAXQUOTAS];
 
+	/* Solve deadlock when we recurse when holding dqptr_sem... */
+	if (IS_NOQUOTA(inode)) {
+		inode_add_bytes(inode, number);
+		return QUOTA_OK;
+	}
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++)
 		warntype[cnt] = NOWARN;
 
 	down_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
 	spin_lock(&dq_data_lock);
+	/* Now recheck reliably when holding dqptr_sem */
 	if (IS_NOQUOTA(inode))
 		goto add_bytes;
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
@@ -921,9 +918,13 @@
 	int cnt, ret = NO_QUOTA;
 	char warntype[MAXQUOTAS];
 
+	/* Solve deadlock when we recurse when holding dqptr_sem... */
+	if (IS_NOQUOTA(inode))
+		return QUOTA_OK;
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++)
 		warntype[cnt] = NOWARN;
 	down_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
+	/* Now recheck reliably when holding dqptr_sem */
 	if (IS_NOQUOTA(inode)) {
 		up_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
 		return QUOTA_OK;
@@ -956,8 +957,14 @@
 {
 	unsigned int cnt;
 
+	/* Solve deadlock when we recurse when holding dqptr_sem... */
+	if (IS_NOQUOTA(inode)) {
+		inode_sub_bytes(inode, number);
+		return;
+	}
 	down_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
 	spin_lock(&dq_data_lock);
+	/* Now recheck reliably when holding dqptr_sem */
 	if (IS_NOQUOTA(inode))
 		goto sub_bytes;
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
@@ -978,7 +985,11 @@
 {
 	unsigned int cnt;
 
+	/* Solve deadlock when we recurse when holding dqptr_sem... */
+	if (IS_NOQUOTA(inode))
+		return;
 	down_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
+	/* Now recheck reliably when holding dqptr_sem */
 	if (IS_NOQUOTA(inode)) {
 		up_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
 		return;
@@ -1007,14 +1018,20 @@
 	    chgid = (iattr->ia_valid & ATTR_GID) && inode->i_gid != iattr->ia_gid;
 	char warntype[MAXQUOTAS];
 
+	/* Solve deadlock when we recurse when holding dqptr_sem... */
+	if (IS_NOQUOTA(inode))
+		return QUOTA_OK;
 	/* Clear the arrays */
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
 		transfer_to[cnt] = transfer_from[cnt] = NODQUOT;
 		warntype[cnt] = NOWARN;
 	}
+	down(&sb_dqopt(inode->i_sb)->dqonoff_sem);
 	down_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
+	/* Now recheck reliably when holding dqptr_sem */
 	if (IS_NOQUOTA(inode)) {	/* File without quota accounting? */
 		up_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
+		up(&sb_dqopt(inode->i_sb)->dqonoff_sem);
 		return QUOTA_OK;
 	}
 	/* First build the transfer_to list - here we can block on reading of dquots... */
@@ -1065,6 +1082,7 @@
 	ret = QUOTA_OK;
 warn_put_all:
 	spin_unlock(&dq_data_lock);
+	up_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
 	flush_warnings(transfer_to, warntype);
 	
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
@@ -1073,7 +1091,7 @@
 		if (ret == NO_QUOTA && transfer_to[cnt] != NODQUOT)
 			dqput(transfer_to[cnt]);
 	}
-	up_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
+	up(&sb_dqopt(inode->i_sb)->dqonoff_sem);
 	return ret;
 }
 
@@ -1121,9 +1139,6 @@
 	}
 }
 
-/* Function in inode.c - remove pointers to dquots in icache */
-extern void remove_dquot_ref(struct super_block *, int);
-
 /*
  * Turn quota off on a device. type == -1 ==> quotaoff for all types (umount)
  */
@@ -1137,7 +1152,6 @@
 
 	/* We need to serialize quota_off() for device */
 	down(&dqopt->dqonoff_sem);
-	down_write(&dqopt->dqptr_sem);
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
 		if (type != -1 && cnt != type)
 			continue;
@@ -1146,7 +1160,7 @@
 		reset_enable_flags(dqopt, cnt);
 
 		/* Note: these are blocking operations */
-		remove_dquot_ref(sb, cnt);
+		drop_dquot_ref(sb, cnt);
 		invalidate_dquots(sb, cnt);
 		/*
 		 * Now all dquots should be invalidated, all writes done so we should be only
@@ -1168,7 +1182,6 @@
 		dqopt->info[cnt].dqi_bgrace = 0;
 		dqopt->ops[cnt] = NULL;
 	}
-	up_write(&dqopt->dqptr_sem);
 	up(&dqopt->dqonoff_sem);
 out:
 	return 0;
@@ -1180,7 +1193,8 @@
 	struct inode *inode;
 	struct quota_info *dqopt = sb_dqopt(sb);
 	struct quota_format_type *fmt = find_quota_format(format_id);
-	int error;
+	int error, cnt;
+	struct dquot *to_drop[MAXQUOTAS];
 	unsigned int oldflags;
 
 	if (!fmt)
@@ -1202,7 +1216,6 @@
 		goto out_f;
 
 	down(&dqopt->dqonoff_sem);
-	down_write(&dqopt->dqptr_sem);
 	if (sb_has_quota_enabled(sb, type)) {
 		error = -EBUSY;
 		goto out_lock;
@@ -1213,8 +1226,20 @@
 	if (!fmt->qf_ops->check_quota_file(sb, type))
 		goto out_file_init;
 	/* We don't want quota and atime on quota files (deadlocks possible) */
-	dquot_drop_nolock(inode);
+	down_write(&dqopt->dqptr_sem);
 	inode->i_flags |= S_NOQUOTA | S_NOATIME;
+	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
+		to_drop[cnt] = inode->i_dquot[cnt];
+		inode->i_dquot[cnt] = NODQUOT;
+	}
+	inode->i_flags &= ~S_QUOTA;
+	up_write(&dqopt->dqptr_sem);
+	/* We must put dquots outside of dqptr_sem because we may need to
+	 * start transaction for write */
+	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
+		if (to_drop[cnt])
+			dqput(to_drop[cnt]);
+	}
 
 	dqopt->ops[type] = fmt->qf_ops;
 	dqopt->info[type].dqi_format = fmt;
@@ -1225,7 +1250,6 @@
 	}
 	up(&dqopt->dqio_sem);
 	set_enable_flags(dqopt, type);
-	up_write(&dqopt->dqptr_sem);
 
 	add_dquot_ref(sb, type);
 	up(&dqopt->dqonoff_sem);
@@ -1268,14 +1292,14 @@
 {
 	struct dquot *dquot;
 
-	down_read(&sb_dqopt(sb)->dqptr_sem);
+	down(&sb_dqopt(sb)->dqonoff_sem);
 	if (!(dquot = dqget(sb, id, type))) {
-		up_read(&sb_dqopt(sb)->dqptr_sem);
+		up(&sb_dqopt(sb)->dqonoff_sem);
 		return -ESRCH;
 	}
 	do_get_dqblk(dquot, di);
 	dqput(dquot);
-	up_read(&sb_dqopt(sb)->dqptr_sem);
+	up(&sb_dqopt(sb)->dqonoff_sem);
 	return 0;
 }
 
@@ -1337,14 +1361,14 @@
 {
 	struct dquot *dquot;
 
-	down_read(&sb_dqopt(sb)->dqptr_sem);
+	down(&sb_dqopt(sb)->dqonoff_sem);
 	if (!(dquot = dqget(sb, id, type))) {
-		up_read(&sb_dqopt(sb)->dqptr_sem);
+		up(&sb_dqopt(sb)->dqonoff_sem);
 		return -ESRCH;
 	}
 	do_set_dqblk(dquot, di);
 	dqput(dquot);
-	up_read(&sb_dqopt(sb)->dqptr_sem);
+	up(&sb_dqopt(sb)->dqonoff_sem);
 	return 0;
 }
 
@@ -1353,9 +1377,9 @@
 {
 	struct mem_dqinfo *mi;
   
-	down_read(&sb_dqopt(sb)->dqptr_sem);
+	down(&sb_dqopt(sb)->dqonoff_sem);
 	if (!sb_has_quota_enabled(sb, type)) {
-		up_read(&sb_dqopt(sb)->dqptr_sem);
+		up(&sb_dqopt(sb)->dqonoff_sem);
 		return -ESRCH;
 	}
 	mi = sb_dqopt(sb)->info + type;
@@ -1365,7 +1389,7 @@
 	ii->dqi_flags = mi->dqi_flags & DQF_MASK;
 	ii->dqi_valid = IIF_ALL;
 	spin_unlock(&dq_data_lock);
-	up_read(&sb_dqopt(sb)->dqptr_sem);
+	up(&sb_dqopt(sb)->dqonoff_sem);
 	return 0;
 }
 
@@ -1374,9 +1398,9 @@
 {
 	struct mem_dqinfo *mi;
 
-	down_read(&sb_dqopt(sb)->dqptr_sem);
+	down(&sb_dqopt(sb)->dqonoff_sem);
 	if (!sb_has_quota_enabled(sb, type)) {
-		up_read(&sb_dqopt(sb)->dqptr_sem);
+		up(&sb_dqopt(sb)->dqonoff_sem);
 		return -ESRCH;
 	}
 	mi = sb_dqopt(sb)->info + type;
@@ -1389,7 +1413,7 @@
 		mi->dqi_flags = (mi->dqi_flags & ~DQF_MASK) | (ii->dqi_flags & DQF_MASK);
 	mark_info_dirty(mi);
 	spin_unlock(&dq_data_lock);
-	up_read(&sb_dqopt(sb)->dqptr_sem);
+	up(&sb_dqopt(sb)->dqonoff_sem);
 	return 0;
 }
 
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4/fs/ext3/super.c linux-2.6.4-1-lockfix/fs/ext3/super.c
--- linux-2.6.4/fs/ext3/super.c	2004-03-17 09:46:58.000000000 +0100
+++ linux-2.6.4-1-lockfix/fs/ext3/super.c	2004-03-17 10:37:23.000000000 +0100
@@ -1958,6 +1958,18 @@
 #define EXT3_V0_QFMT_BLOCKS 27
 
 static int (*old_write_dquot)(struct dquot *dquot);
+static void (*old_drop_dquot)(struct inode *inode);
+
+static int fmt_to_blocks(int fmt)
+{
+	switch (fmt) {
+		case QFMT_VFS_OLD:
+			return  EXT3_OLD_QFMT_BLOCKS;
+		case QFMT_VFS_V0:
+			return EXT3_V0_QFMT_BLOCKS;
+	}
+	return EXT3_MAX_TRANS_DATA;
+}
 
 static int ext3_write_dquot(struct dquot *dquot)
 {
@@ -1965,20 +1977,11 @@
 	int ret;
 	int err;
 	handle_t *handle;
-	struct quota_info *dqops = sb_dqopt(dquot->dq_sb);
+	struct quota_info *dqopt = sb_dqopt(dquot->dq_sb);
 	struct inode *qinode;
 
-	switch (dqops->info[dquot->dq_type].dqi_format->qf_fmt_id) {
-		case QFMT_VFS_OLD:
-			nblocks = EXT3_OLD_QFMT_BLOCKS;
-			break;
-		case QFMT_VFS_V0:
-			nblocks = EXT3_V0_QFMT_BLOCKS;
-			break;
-		default:
-			nblocks = EXT3_MAX_TRANS_DATA;
-	}
-	qinode = dqops->files[dquot->dq_type]->f_dentry->d_inode;
+	nblocks = fmt_to_blocks(dqopt->info[dquot->dq_type].dqi_format->qf_fmt_id);
+	qinode = dqopt->files[dquot->dq_type]->f_dentry->d_inode;
 	handle = ext3_journal_start(qinode, nblocks);
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
@@ -1991,6 +1994,28 @@
 out:
 	return ret;
 }
+
+static void ext3_drop_dquot(struct inode *inode)
+{
+	int nblocks, type;
+	struct quota_info *dqopt = sb_dqopt(inode->i_sb);
+	handle_t *handle;
+
+	for (type = 0; type < MAXQUOTAS; type++) {
+		if (sb_has_quota_enabled(inode->i_sb, type))
+			break;
+	}
+	if (type < MAXQUOTAS)
+		nblocks = fmt_to_blocks(dqopt->info[type].dqi_format->qf_fmt_id);
+	else
+		nblocks = 0;	/* No quota => no drop */ 
+	handle = ext3_journal_start(inode, 2*nblocks);
+	if (IS_ERR(handle))
+		return;
+	old_drop_dquot(inode);
+	ext3_journal_stop(handle);
+	return;
+}
 #endif
 
 static struct super_block *ext3_get_sb(struct file_system_type *fs_type,
@@ -2018,7 +2043,9 @@
 #ifdef CONFIG_QUOTA
 	init_dquot_operations(&ext3_qops);
 	old_write_dquot = ext3_qops.write_dquot;
+	old_drop_dquot = ext3_qops.drop;
 	ext3_qops.write_dquot = ext3_write_dquot;
+	ext3_qops.drop = ext3_drop_dquot;
 #endif
         err = register_filesystem(&ext3_fs_type);
 	if (err)
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4/fs/inode.c linux-2.6.4-1-lockfix/fs/inode.c
--- linux-2.6.4/fs/inode.c	2004-03-17 09:46:59.000000000 +0100
+++ linux-2.6.4-1-lockfix/fs/inode.c	2004-03-17 10:37:23.000000000 +0100
@@ -1214,15 +1214,13 @@
  */
 #ifdef CONFIG_QUOTA
 
-/* Functions back in dquot.c */
-void put_dquot_list(struct list_head *);
+/* Function back in dquot.c */
 int remove_inode_dquot_ref(struct inode *, int, struct list_head *);
 
-void remove_dquot_ref(struct super_block *sb, int type)
+void remove_dquot_ref(struct super_block *sb, int type, struct list_head *tofree_head)
 {
 	struct inode *inode;
 	struct list_head *act_head;
-	LIST_HEAD(tofree_head);
 
 	if (!sb->dq_op)
 		return;	/* nothing to do */
@@ -1232,26 +1230,24 @@
 	list_for_each(act_head, &inode_in_use) {
 		inode = list_entry(act_head, struct inode, i_list);
 		if (inode->i_sb == sb && IS_QUOTAINIT(inode))
-			remove_inode_dquot_ref(inode, type, &tofree_head);
+			remove_inode_dquot_ref(inode, type, tofree_head);
 	}
 	list_for_each(act_head, &inode_unused) {
 		inode = list_entry(act_head, struct inode, i_list);
 		if (inode->i_sb == sb && IS_QUOTAINIT(inode))
-			remove_inode_dquot_ref(inode, type, &tofree_head);
+			remove_inode_dquot_ref(inode, type, tofree_head);
 	}
 	list_for_each(act_head, &sb->s_dirty) {
 		inode = list_entry(act_head, struct inode, i_list);
 		if (IS_QUOTAINIT(inode))
-			remove_inode_dquot_ref(inode, type, &tofree_head);
+			remove_inode_dquot_ref(inode, type, tofree_head);
 	}
 	list_for_each(act_head, &sb->s_io) {
 		inode = list_entry(act_head, struct inode, i_list);
 		if (IS_QUOTAINIT(inode))
-			remove_inode_dquot_ref(inode, type, &tofree_head);
+			remove_inode_dquot_ref(inode, type, tofree_head);
 	}
 	spin_unlock(&inode_lock);
-
-	put_dquot_list(&tofree_head);
 }
 
 #endif
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4/fs/Kconfig linux-2.6.4-1-lockfix/fs/Kconfig
--- linux-2.6.4/fs/Kconfig	2004-03-17 09:46:58.000000000 +0100
+++ linux-2.6.4-1-lockfix/fs/Kconfig	2004-03-22 22:12:54.000000000 +0100
@@ -417,7 +417,7 @@
 	tristate "Old quota format support"
 	depends on QUOTA
 	help
-	  This quota format was (is) used by kernels earlier than 2.4.??. If
+	  This quota format was (is) used by kernels earlier than 2.4.22. If
 	  you have quota working and you don't want to convert to new quota
 	  format say Y here.
 
@@ -426,8 +426,8 @@
 	depends on QUOTA
 	help
 	  This quota format allows using quotas with 32-bit UIDs/GIDs. If you
-	  need this functionality say Y here. Note that you will need latest
-	  quota utilities for new quota format with this kernel.
+	  need this functionality say Y here. Note that you will need recent
+	  quota utilities (>= 3.01) for new quota format with this kernel.
 
 config QUOTACTL
 	bool
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4/include/linux/quotaops.h linux-2.6.4-1-lockfix/include/linux/quotaops.h
--- linux-2.6.4/include/linux/quotaops.h	2004-03-04 09:26:40.000000000 +0100
+++ linux-2.6.4-1-lockfix/include/linux/quotaops.h	2004-03-17 10:37:23.000000000 +0100
@@ -64,11 +64,8 @@
 		if (inode->i_sb->dq_op->alloc_space(inode, nr, 1) == NO_QUOTA)
 			return 1;
 	}
-	else {
-		spin_lock(&dq_data_lock);
+	else
 		inode_add_bytes(inode, nr);
-		spin_unlock(&dq_data_lock);
-	}
 	return 0;
 }
 
@@ -87,11 +84,8 @@
 		if (inode->i_sb->dq_op->alloc_space(inode, nr, 0) == NO_QUOTA)
 			return 1;
 	}
-	else {
-		spin_lock(&dq_data_lock);
+	else
 		inode_add_bytes(inode, nr);
-		spin_unlock(&dq_data_lock);
-	}
 	return 0;
 }
 
@@ -117,11 +111,8 @@
 {
 	if (sb_any_quota_enabled(inode->i_sb))
 		inode->i_sb->dq_op->free_space(inode, nr);
-	else {
-		spin_lock(&dq_data_lock);
+	else
 		inode_sub_bytes(inode, nr);
-		spin_unlock(&dq_data_lock);
-	}
 }
 
 static __inline__ void DQUOT_FREE_SPACE(struct inode *inode, qsize_t nr)
-------------- next part --------------
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4-1-lockfix/fs/dquot.c linux-2.6.4-2-jquota/fs/dquot.c
--- linux-2.6.4-1-lockfix/fs/dquot.c	2004-03-22 21:51:13.000000000 +0100
+++ linux-2.6.4-2-jquota/fs/dquot.c	2004-03-22 21:53:04.000000000 +0100
@@ -1,16 +1,13 @@
 /*
- * Implementation of the diskquota system for the LINUX operating
- * system. QUOTA is implemented using the BSD system call interface as
- * the means of communication with the user level. Currently only the
- * ext2 filesystem has support for disk quotas. Other filesystems may
- * be added in the future. This file contains the generic routines
- * called by the different filesystems on allocation of an inode or
- * block. These routines take care of the administration needed to
- * have a consistent diskquota tracking system. The ideas of both
- * user and group quotas are based on the Melbourne quota system as
- * used on BSD derived systems. The internal implementation is 
- * based on one of the several variants of the LINUX inode-subsystem
- * with added complexity of the diskquota system.
+ * Implementation of the diskquota system for the LINUX operating system. QUOTA
+ * is implemented using the BSD system call interface as the means of
+ * communication with the user level. This file contains the generic routines
+ * called by the different filesystems on allocation of an inode or block.
+ * These routines take care of the administration needed to have a consistent
+ * diskquota tracking system. The ideas of both user and group quotas are based
+ * on the Melbourne quota system as used on BSD derived systems. The internal
+ * implementation is based on one of the several variants of the LINUX
+ * inode-subsystem with added complexity of the diskquota system.
  * 
  * Version: $Id: dquot.c,v 6.3 1996/11/17 18:35:34 mvw Exp mvw $
  * 
@@ -52,6 +49,9 @@
  *		New SMP locking.
  *		Jan Kara, <jack at suse.cz>, 10/2002
  *
+ *		Added journalled quota support
+ *		Jan Kara, <jack at suse.cz>, 2003,2004
+ *
  * (C) Copyright 1994 - 1997 Marco van Wieringen 
  */
 
@@ -104,13 +104,17 @@
  *
  * Each dquot has its dq_lock semaphore. Locked dquots might not be referenced
  * from inodes (dquot_alloc_space() and such don't check the dq_lock).
- * Currently dquot is locked only when it is being read to memory on the first
- * dqget(). Write operations on dquots don't hold dq_lock as they copy data
- * under dq_data_lock spinlock to internal buffers before writing.
+ * Currently dquot is locked only when it is being read to memory (or space for
+ * it is being allocated) on the first dqget() and when it is being released on
+ * the last dqput(). The allocation and release oparations are serialized by
+ * the dq_lock and by checking the use count in dquot_release().  Write
+ * operations on dquots don't hold dq_lock as they copy data under dq_data_lock
+ * spinlock to internal buffers before writing.
  *
  * Lock ordering (including journal_lock) is following:
  *  dqonoff_sem > journal_lock > dqptr_sem > dquot->dq_lock > dqio_sem
  */
+
 spinlock_t dq_list_lock = SPIN_LOCK_UNLOCKED;
 spinlock_t dq_data_lock = SPIN_LOCK_UNLOCKED;
 
@@ -256,6 +260,9 @@
 	dqstats.allocated_dquots--;
 	list_del(&dquot->dq_inuse);
 }
+/*
+ * End of list functions needing dq_list_lock
+ */
 
 static void wait_on_dquot(struct dquot *dquot)
 {
@@ -263,34 +270,98 @@
 	up(&dquot->dq_lock);
 }
 
-static int read_dqblk(struct dquot *dquot)
+#define mark_dquot_dirty(dquot) ((dquot)->dq_sb->dq_op->mark_dirty(dquot))
+
+/* No locks needed here as ANY_DQUOT_DIRTY is used just by sync and so the
+ * worst what can happen is that dquot is not written by concurrent sync... */
+int dquot_mark_dquot_dirty(struct dquot *dquot)
+{
+	set_bit(DQ_MOD_B, &(dquot)->dq_flags);
+	set_bit(DQF_ANY_DQUOT_DIRTY_B, &(sb_dqopt((dquot)->dq_sb)->
+		info[(dquot)->dq_type].dqi_flags));
+	return 0;
+}
+
+void mark_info_dirty(struct super_block *sb, int type)
 {
-	int ret;
+	set_bit(DQF_INFO_DIRTY_B, &sb_dqopt(sb)->info[type].dqi_flags);
+}
+
+
+/*
+ *	Read dquot from disk and alloc space for it
+ */
+
+int dquot_acquire(struct dquot *dquot)
+{
+	int ret = 0;
 	struct quota_info *dqopt = sb_dqopt(dquot->dq_sb);
 
 	down(&dquot->dq_lock);
 	down(&dqopt->dqio_sem);
-	ret = dqopt->ops[dquot->dq_type]->read_dqblk(dquot);
+	if (!test_bit(DQ_READ_B, &dquot->dq_flags))
+		ret = dqopt->ops[dquot->dq_type]->read_dqblk(dquot);
+	if (ret < 0)
+		goto out_iolock;
+	set_bit(DQ_READ_B, &dquot->dq_flags);
+	/* Instantiate dquot if needed */
+	if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags) && !dquot->dq_off) {
+		ret = dqopt->ops[dquot->dq_type]->commit_dqblk(dquot);
+		if (ret < 0)
+			goto out_iolock;
+	}
+	set_bit(DQ_ACTIVE_B, &dquot->dq_flags);
+out_iolock:
 	up(&dqopt->dqio_sem);
 	up(&dquot->dq_lock);
 	return ret;
 }
 
-static int commit_dqblk(struct dquot *dquot)
+/*
+ *	Write dquot to disk
+ */
+int dquot_commit(struct dquot *dquot)
 {
-	int ret;
+	int ret = 0;
 	struct quota_info *dqopt = sb_dqopt(dquot->dq_sb);
 
 	down(&dqopt->dqio_sem);
-	ret = dqopt->ops[dquot->dq_type]->commit_dqblk(dquot);
+	clear_bit(DQ_MOD_B, &dquot->dq_flags);
+	/* Inactive dquot can be only if there was error during read/init
+	 * => we have better not writing it */
+	if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
+		ret = dqopt->ops[dquot->dq_type]->commit_dqblk(dquot);
 	up(&dqopt->dqio_sem);
+	if (info_dirty(&dqopt->info[dquot->dq_type]))
+		dquot->dq_sb->dq_op->write_info(dquot->dq_sb, dquot->dq_type);
+	return ret;
+}
+
+/*
+ *	Release dquot
+ */
+int dquot_release(struct dquot *dquot)
+{
+	int ret = 0;
+	struct quota_info *dqopt = sb_dqopt(dquot->dq_sb);
+
+	down(&dquot->dq_lock);
+	/* Check whether we are not racing with some other dqget() */
+	if (atomic_read(&dquot->dq_count) > 1)
+		goto out_dqlock;
+	down(&dqopt->dqio_sem);
+	ret = dqopt->ops[dquot->dq_type]->release_dqblk(dquot);
+	clear_bit(DQ_ACTIVE_B, &dquot->dq_flags);
+	up(&dqopt->dqio_sem);
+out_dqlock:
+	up(&dquot->dq_lock);
 	return ret;
 }
 
 /* Invalidate all dquots on the list. Note that this function is called after
- * quota is disabled so no new quota might be created. Because we hold
- * dqonoff_sem and pointers were already removed from inodes we actually know
- * that no quota for this sb+type should be held. */
+ * quota is disabled and pointers from inodes removed so there cannot be new
+ * quota users. Also because we hold dqonoff_sem there can be no quota users
+ * for this sb+type at all. */
 static void invalidate_dquots(struct super_block *sb, int type)
 {
 	struct dquot *dquot;
@@ -317,7 +388,7 @@
 	spin_unlock(&dq_list_lock);
 }
 
-static int vfs_quota_sync(struct super_block *sb, int type)
+int vfs_quota_sync(struct super_block *sb, int type)
 {
 	struct list_head *head;
 	struct dquot *dquot;
@@ -328,9 +399,11 @@
 restart:
 	/* At this point any dirty dquot will definitely be written so we can clear
 	   dirty flag from info */
+	spin_lock(&dq_data_lock);
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++)
 		if ((cnt == type || type == -1) && sb_has_quota_enabled(sb, cnt))
 			clear_bit(DQF_ANY_DQUOT_DIRTY_B, &dqopt->info[cnt].dqi_flags);
+	spin_unlock(&dq_data_lock);
 	spin_lock(&dq_list_lock);
 	list_for_each(head, &inuse_list) {
 		dquot = list_entry(head, struct dquot, dq_inuse);
@@ -338,10 +411,13 @@
 			continue;
                 if (type != -1 && dquot->dq_type != type)
 			continue;
-		if (!dquot->dq_sb)	/* Invalidated? */
-			continue;
 		if (!dquot_dirty(dquot))
 			continue;
+		/* Dirty and inactive can be only bad dquot... */
+		if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
+			continue;
+		/* Now we have active dquot from which someone is holding reference so we
+		 * can safely just increase use count */
 		atomic_inc(&dquot->dq_count);
 		dqstats.lookups++;
 		spin_unlock(&dq_list_lock);
@@ -352,11 +428,9 @@
 	spin_unlock(&dq_list_lock);
 
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++)
-		if ((cnt == type || type == -1) && sb_has_quota_enabled(sb, cnt) && info_dirty(&dqopt->info[cnt])) {
-			down(&dqopt->dqio_sem);
-			dqopt->ops[cnt]->write_file_info(sb, cnt);
-			up(&dqopt->dqio_sem);
-		}
+		if ((cnt == type || type == -1) && sb_has_quota_enabled(sb, cnt)
+			&& info_dirty(&dqopt->info[cnt]))
+			sb->dq_op->write_info(sb, cnt);
 	spin_lock(&dq_list_lock);
 	dqstats.syncs++;
 	spin_unlock(&dq_list_lock);
@@ -431,11 +505,20 @@
 		spin_unlock(&dq_list_lock);
 		return;
 	}
-	if (dquot_dirty(dquot)) {
+	/* Need to release dquot? */
+	if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags) && dquot_dirty(dquot)) {
 		spin_unlock(&dq_list_lock);
+		/* Commit dquot before releasing */
 		dquot->dq_sb->dq_op->write_dquot(dquot);
 		goto we_slept;
 	}
+	/* Clear flag in case dquot was inactive (something bad happened) */
+	clear_bit(DQ_MOD_B, &dquot->dq_flags);
+	if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
+		spin_unlock(&dq_list_lock);
+		dquot_release(dquot);
+		goto we_slept;
+	}
 	atomic_dec(&dquot->dq_count);
 #ifdef __DQUOT_PARANOIA
 	/* sanity check */
@@ -494,7 +577,6 @@
 		insert_dquot_hash(dquot);
 		dqstats.lookups++;
 		spin_unlock(&dq_list_lock);
-		read_dqblk(dquot);
 	} else {
 		if (!atomic_read(&dquot->dq_count))
 			remove_free_dquot(dquot);
@@ -502,11 +584,17 @@
 		dqstats.cache_hits++;
 		dqstats.lookups++;
 		spin_unlock(&dq_list_lock);
-		wait_on_dquot(dquot);
 		if (empty)
 			kmem_cache_free(dquot_cachep, empty);
 	}
-
+	/* Wait for dq_lock - after this we know that either dquot_release() is already
+	 * finished or it will be canceled due to dq_count > 1 test */
+	wait_on_dquot(dquot);
+	/* Read the dquot and instantiate it (everything done only if needed) */
+	if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags) && dquot_acquire(dquot) < 0) {
+		dqput(dquot);
+		return NODQUOT;
+	}
 #ifdef __DQUOT_PARANOIA
 	if (!dquot->dq_sb)	/* Has somebody invalidated entry under us? */
 		BUG();
@@ -540,12 +628,10 @@
 		struct file *filp = list_entry(p, struct file, f_list);
 		struct inode *inode = filp->f_dentry->d_inode;
 		if (filp->f_mode & FMODE_WRITE && dqinit_needed(inode, type)) {
-			struct vfsmount *mnt = mntget(filp->f_vfsmnt);
 			struct dentry *dentry = dget(filp->f_dentry);
 			file_list_unlock();
 			sb->dq_op->initialize(inode, type);
 			dput(dentry);
-			mntput(mnt);
 			/* As we may have blocked we had better restart... */
 			goto restart;
 		}
@@ -627,13 +713,11 @@
 static inline void dquot_incr_inodes(struct dquot *dquot, unsigned long number)
 {
 	dquot->dq_dqb.dqb_curinodes += number;
-	mark_dquot_dirty(dquot);
 }
 
 static inline void dquot_incr_space(struct dquot *dquot, qsize_t number)
 {
 	dquot->dq_dqb.dqb_curspace += number;
-	mark_dquot_dirty(dquot);
 }
 
 static inline void dquot_decr_inodes(struct dquot *dquot, unsigned long number)
@@ -645,7 +729,6 @@
 	if (dquot->dq_dqb.dqb_curinodes < dquot->dq_dqb.dqb_isoftlimit)
 		dquot->dq_dqb.dqb_itime = (time_t) 0;
 	clear_bit(DQ_INODES_B, &dquot->dq_flags);
-	mark_dquot_dirty(dquot);
 }
 
 static inline void dquot_decr_space(struct dquot *dquot, qsize_t number)
@@ -657,7 +740,6 @@
 	if (toqb(dquot->dq_dqb.dqb_curspace) < dquot->dq_dqb.dqb_bsoftlimit)
 		dquot->dq_dqb.dqb_btime = (time_t) 0;
 	clear_bit(DQ_BLKS_B, &dquot->dq_flags);
-	mark_dquot_dirty(dquot);
 }
 
 static inline int need_print_warning(struct dquot *dquot)
@@ -810,25 +892,22 @@
 }
 
 /*
- * Externally referenced functions through dquot_operations in inode.
- *
- * Note: this is a blocking operation.
+ *	Initialize quota pointers in inode
+ *	Transaction must be started at entry
  */
-void dquot_initialize(struct inode *inode, int type)
+int dquot_initialize(struct inode *inode, int type)
 {
 	unsigned int id = 0;
-	int cnt;
+	int cnt, ret = 0;
 
-	/* Solve deadlock when we recurse when holding dqptr_sem... */
+	/* First test before acquiring semaphore - solves deadlocks when we
+         * re-enter the quota code and are already holding the semaphore */
 	if (IS_NOQUOTA(inode))
-		return;
+		return 0;
 	down_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
 	/* Having dqptr_sem we know NOQUOTA flags can't be altered... */
-	if (IS_NOQUOTA(inode)) {
-		up_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
-		return;
-	}
-	/* Build list of quotas to initialize... */
+	if (IS_NOQUOTA(inode))
+		goto out_err;
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
 		if (type != -1 && cnt != type)
 			continue;
@@ -846,14 +925,16 @@
 				inode->i_flags |= S_QUOTA;
 		}
 	}
+out_err:
 	up_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
+	return ret;
 }
 
 /*
  * 	Release all quotas referenced by inode
  *	Transaction must be started at an entry
  */
-void dquot_drop(struct inode *inode)
+int dquot_drop(struct inode *inode)
 {
 	int cnt;
 
@@ -866,9 +947,19 @@
 		}
 	}
 	up_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
+	return 0;
 }
 
 /*
+ * Following four functions update i_blocks+i_bytes fields and
+ * quota information (together with appropriate checks)
+ * NOTE: We absolutely rely on the fact that caller dirties
+ * the inode (usually macros in quotaops.h care about this) and
+ * holds a handle for the current transaction so that dquot write and
+ * inode write go into the same transaction.
+ */
+
+/*
  * This operation can block, but only after everything is updated
  */
 int dquot_alloc_space(struct inode *inode, qsize_t number, int warn)
@@ -876,8 +967,10 @@
 	int cnt, ret = NO_QUOTA;
 	char warntype[MAXQUOTAS];
 
-	/* Solve deadlock when we recurse when holding dqptr_sem... */
+	/* First test before acquiring semaphore - solves deadlocks when we
+         * re-enter the quota code and are already holding the semaphore */
 	if (IS_NOQUOTA(inode)) {
+out_add:
 		inode_add_bytes(inode, number);
 		return QUOTA_OK;
 	}
@@ -885,10 +978,11 @@
 		warntype[cnt] = NOWARN;
 
 	down_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
+	if (IS_NOQUOTA(inode)) {	/* Now we can do reliable test... */
+		up_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
+		goto out_add;
+	}
 	spin_lock(&dq_data_lock);
-	/* Now recheck reliably when holding dqptr_sem */
-	if (IS_NOQUOTA(inode))
-		goto add_bytes;
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
 		if (inode->i_dquot[cnt] == NODQUOT)
 			continue;
@@ -900,11 +994,15 @@
 			continue;
 		dquot_incr_space(inode->i_dquot[cnt], number);
 	}
-add_bytes:
 	inode_add_bytes(inode, number);
 	ret = QUOTA_OK;
 warn_put_all:
 	spin_unlock(&dq_data_lock);
+	if (ret == QUOTA_OK)
+		/* Dirtify all the dquots - this can block when journalling */
+		for (cnt = 0; cnt < MAXQUOTAS; cnt++)
+			if (inode->i_dquot[cnt])
+				mark_dquot_dirty(inode->i_dquot[cnt]);
 	flush_warnings(inode->i_dquot, warntype);
 	up_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
 	return ret;
@@ -918,13 +1016,13 @@
 	int cnt, ret = NO_QUOTA;
 	char warntype[MAXQUOTAS];
 
-	/* Solve deadlock when we recurse when holding dqptr_sem... */
+	/* First test before acquiring semaphore - solves deadlocks when we
+         * re-enter the quota code and are already holding the semaphore */
 	if (IS_NOQUOTA(inode))
 		return QUOTA_OK;
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++)
 		warntype[cnt] = NOWARN;
 	down_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
-	/* Now recheck reliably when holding dqptr_sem */
 	if (IS_NOQUOTA(inode)) {
 		up_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
 		return QUOTA_OK;
@@ -945,6 +1043,11 @@
 	ret = QUOTA_OK;
 warn_put_all:
 	spin_unlock(&dq_data_lock);
+	if (ret == QUOTA_OK)
+		/* Dirtify all the dquots - this can block when journalling */
+		for (cnt = 0; cnt < MAXQUOTAS; cnt++)
+			if (inode->i_dquot[cnt])
+				mark_dquot_dirty(inode->i_dquot[cnt]);
 	flush_warnings((struct dquot **)inode->i_dquot, warntype);
 	up_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
 	return ret;
@@ -953,46 +1056,55 @@
 /*
  * This is a non-blocking operation.
  */
-void dquot_free_space(struct inode *inode, qsize_t number)
+int dquot_free_space(struct inode *inode, qsize_t number)
 {
 	unsigned int cnt;
 
-	/* Solve deadlock when we recurse when holding dqptr_sem... */
+	/* First test before acquiring semaphore - solves deadlocks when we
+         * re-enter the quota code and are already holding the semaphore */
 	if (IS_NOQUOTA(inode)) {
+out_sub:
 		inode_sub_bytes(inode, number);
-		return;
+		return QUOTA_OK;
 	}
 	down_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
-	spin_lock(&dq_data_lock);
 	/* Now recheck reliably when holding dqptr_sem */
-	if (IS_NOQUOTA(inode))
-		goto sub_bytes;
+	if (IS_NOQUOTA(inode)) {
+		up_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
+		goto out_sub;
+	}
+	spin_lock(&dq_data_lock);
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
 		if (inode->i_dquot[cnt] == NODQUOT)
 			continue;
 		dquot_decr_space(inode->i_dquot[cnt], number);
 	}
-sub_bytes:
 	inode_sub_bytes(inode, number);
 	spin_unlock(&dq_data_lock);
+	/* Dirtify all the dquots - this can block when journalling */
+	for (cnt = 0; cnt < MAXQUOTAS; cnt++)
+		if (inode->i_dquot[cnt])
+			mark_dquot_dirty(inode->i_dquot[cnt]);
 	up_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
+	return QUOTA_OK;
 }
 
 /*
  * This is a non-blocking operation.
  */
-void dquot_free_inode(const struct inode *inode, unsigned long number)
+int dquot_free_inode(const struct inode *inode, unsigned long number)
 {
 	unsigned int cnt;
 
-	/* Solve deadlock when we recurse when holding dqptr_sem... */
+	/* First test before acquiring semaphore - solves deadlocks when we
+         * re-enter the quota code and are already holding the semaphore */
 	if (IS_NOQUOTA(inode))
-		return;
+		return QUOTA_OK;
 	down_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
 	/* Now recheck reliably when holding dqptr_sem */
 	if (IS_NOQUOTA(inode)) {
 		up_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
-		return;
+		return QUOTA_OK;
 	}
 	spin_lock(&dq_data_lock);
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
@@ -1001,7 +1113,12 @@
 		dquot_decr_inodes(inode->i_dquot[cnt], number);
 	}
 	spin_unlock(&dq_data_lock);
+	/* Dirtify all the dquots - this can block when journalling */
+	for (cnt = 0; cnt < MAXQUOTAS; cnt++)
+		if (inode->i_dquot[cnt])
+			mark_dquot_dirty(inode->i_dquot[cnt]);
 	up_read(&sb_dqopt(inode->i_sb)->dqptr_sem);
+	return QUOTA_OK;
 }
 
 /*
@@ -1018,7 +1135,8 @@
 	    chgid = (iattr->ia_valid & ATTR_GID) && inode->i_gid != iattr->ia_gid;
 	char warntype[MAXQUOTAS];
 
-	/* Solve deadlock when we recurse when holding dqptr_sem... */
+	/* First test before acquiring semaphore - solves deadlocks when we
+         * re-enter the quota code and are already holding the semaphore */
 	if (IS_NOQUOTA(inode))
 		return QUOTA_OK;
 	/* Clear the arrays */
@@ -1026,15 +1144,15 @@
 		transfer_to[cnt] = transfer_from[cnt] = NODQUOT;
 		warntype[cnt] = NOWARN;
 	}
-	down(&sb_dqopt(inode->i_sb)->dqonoff_sem);
 	down_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
 	/* Now recheck reliably when holding dqptr_sem */
 	if (IS_NOQUOTA(inode)) {	/* File without quota accounting? */
 		up_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
-		up(&sb_dqopt(inode->i_sb)->dqonoff_sem);
 		return QUOTA_OK;
 	}
-	/* First build the transfer_to list - here we can block on reading of dquots... */
+	/* First build the transfer_to list - here we can block on
+	 * reading/instantiating of dquots.  We know that the transaction for
+	 * us was already started so we don't violate lock ranking here */
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
 		switch (cnt) {
 			case USRQUOTA:
@@ -1082,7 +1200,13 @@
 	ret = QUOTA_OK;
 warn_put_all:
 	spin_unlock(&dq_data_lock);
-	up_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
+	/* Dirtify all the dquots - this can block when journalling */
+	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
+		if (transfer_from[cnt])
+			mark_dquot_dirty(transfer_from[cnt]);
+		if (transfer_to[cnt])
+			mark_dquot_dirty(transfer_to[cnt]);
+	}
 	flush_warnings(transfer_to, warntype);
 	
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
@@ -1091,7 +1215,21 @@
 		if (ret == NO_QUOTA && transfer_to[cnt] != NODQUOT)
 			dqput(transfer_to[cnt]);
 	}
-	up(&sb_dqopt(inode->i_sb)->dqonoff_sem);
+	up_write(&sb_dqopt(inode->i_sb)->dqptr_sem);
+	return ret;
+}
+
+/*
+ * Write info of quota file to disk
+ */
+int dquot_commit_info(struct super_block *sb, int type)
+{
+	int ret;
+	struct quota_info *dqopt = sb_dqopt(sb);
+
+	down(&dqopt->dqio_sem);
+	ret = dqopt->ops[type]->write_file_info(sb, type);
+	up(&dqopt->dqio_sem);
 	return ret;
 }
 
@@ -1099,22 +1237,18 @@
  * Definitions of diskquota operations.
  */
 struct dquot_operations dquot_operations = {
-	.initialize	= dquot_initialize,		/* mandatory */
-	.drop		= dquot_drop,			/* mandatory */
+	.initialize	= dquot_initialize,
+	.drop		= dquot_drop,
 	.alloc_space	= dquot_alloc_space,
 	.alloc_inode	= dquot_alloc_inode,
 	.free_space	= dquot_free_space,
 	.free_inode	= dquot_free_inode,
 	.transfer	= dquot_transfer,
-	.write_dquot	= commit_dqblk
+	.write_dquot	= dquot_commit,
+	.mark_dirty	= dquot_mark_dquot_dirty,
+	.write_info	= dquot_commit_info
 };
 
-/* Function used by filesystems for initializing the dquot_operations structure */
-void init_dquot_operations(struct dquot_operations *fsdqops)
-{
-	memcpy(fsdqops, &dquot_operations, sizeof(dquot_operations));
-}
-
 static inline void set_enable_flags(struct quota_info *dqopt, int type)
 {
 	switch (type) {
@@ -1166,17 +1300,14 @@
 		 * Now all dquots should be invalidated, all writes done so we should be only
 		 * users of the info. No locks needed.
 		 */
-		if (info_dirty(&dqopt->info[cnt])) {
-			down(&dqopt->dqio_sem);
-			dqopt->ops[cnt]->write_file_info(sb, cnt);
-			up(&dqopt->dqio_sem);
-		}
+		if (info_dirty(&dqopt->info[cnt]))
+			sb->dq_op->write_info(sb, cnt);
 		if (dqopt->ops[cnt]->free_file_info)
 			dqopt->ops[cnt]->free_file_info(sb, cnt);
 		put_quota_format(dqopt->info[cnt].dqi_format);
 
 		fput(dqopt->files[cnt]);
-		dqopt->files[cnt] = (struct file *)NULL;
+		dqopt->files[cnt] = NULL;
 		dqopt->info[cnt].dqi_flags = 0;
 		dqopt->info[cnt].dqi_igrace = 0;
 		dqopt->info[cnt].dqi_bgrace = 0;
@@ -1187,33 +1318,30 @@
 	return 0;
 }
 
-int vfs_quota_on(struct super_block *sb, int type, int format_id, char *path)
+/*
+ *	Turn quotas on on a device
+ */
+
+/* Helper function when we already have file open */
+static int vfs_quota_on_file(struct file *f, int type, int format_id)
 {
-	struct file *f;
+	struct quota_format_type *fmt = find_quota_format(format_id);
 	struct inode *inode;
+	struct super_block *sb = f->f_dentry->d_sb;
 	struct quota_info *dqopt = sb_dqopt(sb);
-	struct quota_format_type *fmt = find_quota_format(format_id);
-	int error, cnt;
 	struct dquot *to_drop[MAXQUOTAS];
+	int error, cnt;
 	unsigned int oldflags;
 
 	if (!fmt)
 		return -ESRCH;
-	f = filp_open(path, O_RDWR, 0600);
-	if (IS_ERR(f)) {
-		error = PTR_ERR(f);
-		goto out_fmt;
-	}
 	error = -EIO;
 	if (!f->f_op || !f->f_op->read || !f->f_op->write)
-		goto out_f;
-	error = security_quota_on(f);
-	if (error)
-		goto out_f;
+		goto out_fmt;
 	inode = f->f_dentry->d_inode;
 	error = -EACCES;
 	if (!S_ISREG(inode->i_mode))
-		goto out_f;
+		goto out_fmt;
 
 	down(&dqopt->dqonoff_sem);
 	if (sb_has_quota_enabled(sb, type)) {
@@ -1235,7 +1363,7 @@
 	inode->i_flags &= ~S_QUOTA;
 	up_write(&dqopt->dqptr_sem);
 	/* We must put dquots outside of dqptr_sem because we may need to
-	 * start transaction for write */
+	 * start transaction for dquot_release() */
 	for (cnt = 0; cnt < MAXQUOTAS; cnt++) {
 		if (to_drop[cnt])
 			dqput(to_drop[cnt]);
@@ -1262,14 +1390,58 @@
 out_lock:
 	up_write(&dqopt->dqptr_sem);
 	up(&dqopt->dqonoff_sem);
-out_f:
-	filp_close(f, NULL);
 out_fmt:
 	put_quota_format(fmt);
 
 	return error; 
 }
 
+/* Actual function called from quotactl() */
+int vfs_quota_on(struct super_block *sb, int type, int format_id, char *path)
+{
+	struct file *f;
+	int error;
+
+	f = filp_open(path, O_RDWR, 0600);
+	if (IS_ERR(f))
+		return PTR_ERR(f);
+	error = security_quota_on(f);
+	if (error)
+		goto out_f;
+	error = vfs_quota_on_file(f, type, format_id);
+	if (!error)
+		return 0;
+out_f:
+	filp_close(f, NULL);
+	return error;
+}
+
+/*
+ * Function used by filesystems when filp_open() would fail (filesystem is
+ * being mounted now). We will use a private file structure. Caller is
+ * responsible that it's IO functions won't need vfsmnt structure or
+ * some dentry tricks...
+ */
+int vfs_quota_on_mount(int type, int format_id, struct dentry *dentry)
+{
+	struct file *f;
+	int error;
+
+	dget(dentry);	/* Get a reference for struct file */
+	f = dentry_open(dentry, NULL, O_RDWR);
+	if (IS_ERR(f)) {
+		error = PTR_ERR(f);
+		goto out_dentry;
+	}
+	error = vfs_quota_on_file(f, type, format_id);
+	if (!error)
+		return 0;
+	fput(f);
+out_dentry:
+	dput(dentry);
+	return error;
+}
+
 /* Generic routine for getting common part of quota structure */
 static void do_get_dqblk(struct dquot *dquot, struct if_dqblk *di)
 {
@@ -1353,8 +1525,8 @@
 		clear_bit(DQ_FAKE_B, &dquot->dq_flags);
 	else
 		set_bit(DQ_FAKE_B, &dquot->dq_flags);
-	mark_dquot_dirty(dquot);
 	spin_unlock(&dq_data_lock);
+	mark_dquot_dirty(dquot);
 }
 
 int vfs_set_dqblk(struct super_block *sb, int type, qid_t id, struct if_dqblk *di)
@@ -1411,8 +1583,10 @@
 		mi->dqi_igrace = ii->dqi_igrace;
 	if (ii->dqi_valid & IIF_FLAGS)
 		mi->dqi_flags = (mi->dqi_flags & ~DQF_MASK) | (ii->dqi_flags & DQF_MASK);
-	mark_info_dirty(mi);
 	spin_unlock(&dq_data_lock);
+	mark_info_dirty(sb, type);
+	/* Force write to disk */
+	sb->dq_op->write_info(sb, type);
 	up(&sb_dqopt(sb)->dqonoff_sem);
 	return 0;
 }
@@ -1544,4 +1718,21 @@
 EXPORT_SYMBOL(dqstats);
 EXPORT_SYMBOL(dq_list_lock);
 EXPORT_SYMBOL(dq_data_lock);
-EXPORT_SYMBOL(init_dquot_operations);
+EXPORT_SYMBOL(vfs_quota_on);
+EXPORT_SYMBOL(vfs_quota_on_mount);
+EXPORT_SYMBOL(vfs_quota_off);
+EXPORT_SYMBOL(vfs_quota_sync);
+EXPORT_SYMBOL(vfs_get_dqinfo);
+EXPORT_SYMBOL(vfs_set_dqinfo);
+EXPORT_SYMBOL(vfs_get_dqblk);
+EXPORT_SYMBOL(vfs_set_dqblk);
+EXPORT_SYMBOL(dquot_commit);
+EXPORT_SYMBOL(dquot_commit_info);
+EXPORT_SYMBOL(dquot_mark_dquot_dirty);
+EXPORT_SYMBOL(dquot_initialize);
+EXPORT_SYMBOL(dquot_drop);
+EXPORT_SYMBOL(dquot_alloc_space);
+EXPORT_SYMBOL(dquot_alloc_inode);
+EXPORT_SYMBOL(dquot_free_space);
+EXPORT_SYMBOL(dquot_free_inode);
+EXPORT_SYMBOL(dquot_transfer);
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4-1-lockfix/fs/ext3/inode.c linux-2.6.4-2-jquota/fs/ext3/inode.c
--- linux-2.6.4-1-lockfix/fs/ext3/inode.c	2004-03-04 09:26:37.000000000 +0100
+++ linux-2.6.4-2-jquota/fs/ext3/inode.c	2004-03-22 21:29:48.000000000 +0100
@@ -2772,9 +2772,28 @@
 
 	if ((ia_valid & ATTR_UID && attr->ia_uid != inode->i_uid) ||
 		(ia_valid & ATTR_GID && attr->ia_gid != inode->i_gid)) {
+		handle_t *handle;
+
+		/* (user+group)*(old+new) structure, inode write (sb,
+		 * inode block, ? - but truncate inode update has it) */
+		handle = ext3_journal_start(inode, 4*EXT3_QUOTA_INIT_BLOCKS+3);
+		if (IS_ERR(handle)) {
+			error = PTR_ERR(handle);
+			goto err_out;
+		}
 		error = DQUOT_TRANSFER(inode, attr) ? -EDQUOT : 0;
-		if (error)
+		if (error) {
+			ext3_journal_stop(handle);
 			return error;
+		}
+		/* Update corresponding info in inode so that everything is in
+		 * one transaction */
+		if (attr->ia_valid & ATTR_UID)
+			inode->i_uid = attr->ia_uid;
+		if (attr->ia_valid & ATTR_GID)
+			inode->i_gid = attr->ia_gid;
+		error = ext3_mark_inode_dirty(handle, inode);
+		ext3_journal_stop(handle);
 	}
 
 	if (S_ISREG(inode->i_mode) &&
@@ -2853,7 +2872,9 @@
 		ret = 2 * (bpp + indirects) + 2;
 
 #ifdef CONFIG_QUOTA
-	ret += 2 * EXT3_SINGLEDATA_TRANS_BLOCKS;
+	/* We know that structure was already allocated during DQUOT_INIT so
+	 * we will be updating only the data blocks + inodes */
+	ret += 2*EXT3_QUOTA_TRANS_BLOCKS;
 #endif
 
 	return ret;
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4-1-lockfix/fs/ext3/namei.c linux-2.6.4-2-jquota/fs/ext3/namei.c
--- linux-2.6.4-1-lockfix/fs/ext3/namei.c	2004-03-04 09:53:20.000000000 +0100
+++ linux-2.6.4-2-jquota/fs/ext3/namei.c	2004-03-22 21:29:48.000000000 +0100
@@ -1633,7 +1633,8 @@
 	int err;
 
 	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS +
-					EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3);
+					EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 +
+					2*EXT3_QUOTA_INIT_BLOCKS);
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
@@ -1663,7 +1664,8 @@
 		return -EINVAL;
 
 	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS +
-			 		EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3);
+			 		EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 +
+					2*EXT3_QUOTA_INIT_BLOCKS);
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
@@ -1695,7 +1697,8 @@
 		return -EMLINK;
 
 	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS +
-					EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3);
+					EXT3_INDEX_EXTRA_TRANS_BLOCKS + 3 +
+					2*EXT3_QUOTA_INIT_BLOCKS);
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
@@ -1974,6 +1977,9 @@
 	struct ext3_dir_entry_2 * de;
 	handle_t *handle;
 
+	/* Initialize quotas before so that eventual writes go in
+	 * separate transaction */
+	DQUOT_INIT(dentry->d_inode);
 	handle = ext3_journal_start(dir, EXT3_DELETE_TRANS_BLOCKS);
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
@@ -1987,7 +1993,6 @@
 		handle->h_sync = 1;
 
 	inode = dentry->d_inode;
-	DQUOT_INIT(inode);
 
 	retval = -EIO;
 	if (le32_to_cpu(de->inode) != inode->i_ino)
@@ -2031,6 +2036,9 @@
 	struct ext3_dir_entry_2 * de;
 	handle_t *handle;
 
+	/* Initialize quotas before so that eventual writes go
+	 * in separate transaction */
+	DQUOT_INIT(dentry->d_inode);
 	handle = ext3_journal_start(dir, EXT3_DELETE_TRANS_BLOCKS);
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
@@ -2044,7 +2052,6 @@
 		goto end_unlink;
 
 	inode = dentry->d_inode;
-	DQUOT_INIT(inode);
 
 	retval = -EIO;
 	if (le32_to_cpu(de->inode) != inode->i_ino)
@@ -2087,7 +2094,8 @@
 		return -ENAMETOOLONG;
 
 	handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS +
-			 		EXT3_INDEX_EXTRA_TRANS_BLOCKS + 5);
+			 		EXT3_INDEX_EXTRA_TRANS_BLOCKS + 5 +
+					2*EXT3_QUOTA_INIT_BLOCKS);
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
@@ -2172,6 +2180,10 @@
 
 	old_bh = new_bh = dir_bh = NULL;
 
+	/* Initialize quotas before so that eventual writes go
+	 * in separate transaction */
+	if (new_dentry->d_inode)
+		DQUOT_INIT(new_dentry->d_inode);
 	handle = ext3_journal_start(old_dir, 2 * EXT3_DATA_TRANS_BLOCKS +
 			 		EXT3_INDEX_EXTRA_TRANS_BLOCKS + 2);
 	if (IS_ERR(handle))
@@ -2198,8 +2210,6 @@
 		if (!new_inode) {
 			brelse (new_bh);
 			new_bh = NULL;
-		} else {
-			DQUOT_INIT(new_inode);
 		}
 	}
 	if (S_ISDIR(old_inode->i_mode)) {
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4-1-lockfix/fs/ext3/super.c linux-2.6.4-2-jquota/fs/ext3/super.c
--- linux-2.6.4-1-lockfix/fs/ext3/super.c	2004-03-17 10:37:23.000000000 +0100
+++ linux-2.6.4-2-jquota/fs/ext3/super.c	2004-03-22 21:29:49.000000000 +0100
@@ -32,6 +32,9 @@
 #include <linux/buffer_head.h>
 #include <linux/vfs.h>
 #include <linux/random.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/quotaops.h>
 #include <asm/uaccess.h>
 #include "xattr.h"
 #include "acl.h"
@@ -504,7 +507,43 @@
 # define ext3_clear_inode NULL
 #endif
 
-static struct dquot_operations ext3_qops;
+#ifdef CONFIG_QUOTA
+
+#define QTYPE2NAME(t) ((t)==USRQUOTA?"user":"group")
+#define QTYPE2MOPT(on, t) ((t)==USRQUOTA?((on)##USRJQUOTA):((on)##GRPJQUOTA))
+
+static int ext3_dquot_initialize(struct inode *inode, int type);
+static int ext3_dquot_drop(struct inode *inode);
+static int ext3_write_dquot(struct dquot *dquot);
+static int ext3_mark_dquot_dirty(struct dquot *dquot);
+static int ext3_write_info(struct super_block *sb, int type);
+static int ext3_quota_on(struct super_block *sb, int type, int format_id, char *path);
+static int ext3_quota_on_mount(struct super_block *sb, int type);
+static int ext3_quota_off_mount(struct super_block *sb, int type);
+
+static struct dquot_operations ext3_quota_operations = {
+	.initialize	= ext3_dquot_initialize,
+	.drop		= ext3_dquot_drop,
+	.alloc_space	= dquot_alloc_space,
+	.alloc_inode	= dquot_alloc_inode,
+	.free_space	= dquot_free_space,
+	.free_inode	= dquot_free_inode,
+	.transfer	= dquot_transfer,
+	.write_dquot	= ext3_write_dquot,
+	.mark_dirty	= ext3_mark_dquot_dirty,
+	.write_info	= ext3_write_info
+};
+
+static struct quotactl_ops ext3_qctl_operations = {
+	.quota_on	= ext3_quota_on,
+	.quota_off	= vfs_quota_off,
+	.quota_sync	= vfs_quota_sync,
+	.get_info	= vfs_get_dqinfo,
+	.set_info	= vfs_set_dqinfo,
+	.get_dqblk	= vfs_get_dqblk,
+	.set_dqblk	= vfs_set_dqblk
+};
+#endif
 
 static struct super_operations ext3_sops = {
 	.alloc_inode	= ext3_alloc_inode,
@@ -536,6 +575,8 @@
 	Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl, Opt_noload,
 	Opt_commit, Opt_journal_update, Opt_journal_inum,
 	Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
+	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
+	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0,
 	Opt_ignore, Opt_err,
 };
 
@@ -571,6 +612,12 @@
 	{Opt_data_journal, "data=journal"},
 	{Opt_data_ordered, "data=ordered"},
 	{Opt_data_writeback, "data=writeback"},
+	{Opt_offusrjquota, "usrjquota="},
+	{Opt_usrjquota, "usrjquota=%s"},
+	{Opt_offgrpjquota, "grpjquota="},
+	{Opt_grpjquota, "grpjquota=%s"},
+	{Opt_jqfmt_vfsold, "jqfmt=vfsold"},
+	{Opt_jqfmt_vfsv0, "jqfmt=vfsv0"},
 	{Opt_ignore, "grpquota"},
 	{Opt_ignore, "noquota"},
 	{Opt_ignore, "quota"},
@@ -598,13 +645,17 @@
 	return sb_block;
 }
 
-static int parse_options (char * options, struct ext3_sb_info *sbi,
+static int parse_options (char * options, struct super_block *sb,
 			  unsigned long * inum, int is_remount)
 {
+	struct ext3_sb_info *sbi = EXT3_SB(sb);
 	char * p;
 	substring_t args[MAX_OPT_ARGS];
 	int data_opt = 0;
 	int option;
+#ifdef CONFIG_QUOTA
+	int qtype;
+#endif
 
 	if (!options)
 		return 1;
@@ -759,6 +810,76 @@
 				sbi->s_mount_opt |= data_opt;
 			}
 			break;
+#ifdef CONFIG_QUOTA
+		case Opt_usrjquota:
+			qtype = USRQUOTA;
+			goto set_qf_name;
+		case Opt_grpjquota:
+			qtype = GRPQUOTA;
+set_qf_name:
+			if (sb_any_quota_enabled(sb)) {
+				printk(KERN_ERR
+					"EXT3-fs: Cannot change journalled "
+					"quota options when quota turned on.\n");
+				return 0;
+			}
+			if (sbi->s_qf_names[qtype]) {
+				printk(KERN_ERR
+					"EXT3-fs: %s quota file already "
+					"specified.\n", QTYPE2NAME(qtype));
+				return 0;
+			}
+			sbi->s_qf_names[qtype] = match_strdup(&args[0]);
+			if (!sbi->s_qf_names[qtype]) {
+				printk(KERN_ERR
+					"EXT3-fs: not enough memory for "
+					"storing quotafile name.\n");
+				return 0;
+			}
+			if (strchr(sbi->s_qf_names[qtype], '/')) {
+				printk(KERN_ERR
+					"EXT3-fs: quotafile must be on "
+					"filesystem root.\n");
+				kfree(sbi->s_qf_names[qtype]);
+				sbi->s_qf_names[qtype] = NULL;
+				return 0;
+			}
+			break;
+		case Opt_offusrjquota:
+			qtype = USRQUOTA;
+			goto clear_qf_name;
+		case Opt_offgrpjquota:
+			qtype = GRPQUOTA;
+clear_qf_name:
+			if (sb_any_quota_enabled(sb)) {
+				printk(KERN_ERR "EXT3-fs: Cannot change "
+					"journalled quota options when "
+					"quota turned on.\n");
+				return 0;
+			}
+			if (sbi->s_qf_names[qtype]) {
+				kfree(sbi->s_qf_names[qtype]);
+				sbi->s_qf_names[qtype] = NULL;
+			}
+			break;
+		case Opt_jqfmt_vfsold:
+			sbi->s_jquota_fmt = QFMT_VFS_OLD;
+			break;
+		case Opt_jqfmt_vfsv0:
+			sbi->s_jquota_fmt = QFMT_VFS_V0;
+			break;
+#else
+		case Opt_usrjquota:
+		case Opt_grpjquota:
+		case Opt_offusrjquota:
+		case Opt_offgrpjquota:
+		case Opt_jqfmt_vfsold:
+		case Opt_jqfmt_vfsv0:
+			printk(KERN_ERR
+				"EXT3-fs: journalled quota options not "
+				"supported.\n");
+			break;
+#endif
 		case Opt_abort:
 			set_opt(sbi->s_mount_opt, ABORT);
 			break;
@@ -771,6 +892,13 @@
 			return 0;
 		}
 	}
+#ifdef CONFIG_QUOTA
+	if (!sbi->s_jquota_fmt && (sbi->s_qf_names[0] || sbi->s_qf_names[1])) {
+		printk(KERN_ERR
+			"EXT3-fs: journalled quota format not specified.\n");
+		return 0;
+	}
+#endif
 
 	return 1;
 }
@@ -930,6 +1058,9 @@
 {
 	unsigned int s_flags = sb->s_flags;
 	int nr_orphans = 0, nr_truncates = 0;
+#ifdef CONFIG_QUOTA
+	int i;
+#endif
 	if (!es->s_last_orphan) {
 		jbd_debug(4, "no orphan inodes to clean up\n");
 		return;
@@ -949,6 +1080,20 @@
 		       sb->s_id);
 		sb->s_flags &= ~MS_RDONLY;
 	}
+#ifdef CONFIG_QUOTA
+	/* Needed for iput() to work correctly and not trash data */
+	sb->s_flags |= MS_ACTIVE;
+	/* Turn on quotas so that they are updated correctly */
+	for (i = 0; i < MAXQUOTAS; i++) {
+		if (EXT3_SB(sb)->s_qf_names[i]) {
+			int ret = ext3_quota_on_mount(sb, i);
+			if (ret < 0)
+				printk(KERN_ERR
+					"EXT3-fs: Cannot turn on journalled "
+					"quota: error %d\n", ret);
+		}
+	}
+#endif
 
 	while (es->s_last_orphan) {
 		struct inode *inode;
@@ -960,6 +1105,7 @@
 		}
 
 		list_add(&EXT3_I(inode)->i_orphan, &EXT3_SB(sb)->s_orphan);
+		DQUOT_INIT(inode);
 		if (inode->i_nlink) {
 			printk(KERN_DEBUG
 				"%s: truncating inode %ld to %Ld bytes\n",
@@ -987,6 +1133,13 @@
 	if (nr_truncates)
 		printk(KERN_INFO "EXT3-fs: %s: %d truncate%s cleaned up\n",
 		       sb->s_id, PLURAL(nr_truncates));
+#ifdef CONFIG_QUOTA
+	/* Turn quotas off */
+	for (i = 0; i < MAXQUOTAS; i++) {
+		if (sb_dqopt(sb)->files[i])
+			ext3_quota_off_mount(sb, i);
+	}
+#endif
 	sb->s_flags = s_flags; /* Restore MS_RDONLY status */
 }
 
@@ -1117,7 +1270,7 @@
 	sbi->s_resuid = le16_to_cpu(es->s_def_resuid);
 	sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
 
-	if (!parse_options ((char *) data, sbi, &journal_inum, 0))
+	if (!parse_options ((char *) data, sb, &journal_inum, 0))
 		goto failed_mount;
 
 	sb->s_flags |= MS_ONE_SECOND;
@@ -1296,7 +1449,10 @@
 	 */
 	sb->s_op = &ext3_sops;
 	sb->s_export_op = &ext3_export_ops;
-	sb->dq_op = &ext3_qops;
+#ifdef CONFIG_QUOTA
+	sb->s_qcop = &ext3_qctl_operations;
+	sb->dq_op = &ext3_quota_operations;
+#endif
 	INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
 
 	sb->s_root = 0;
@@ -1406,6 +1562,12 @@
 		brelse(sbi->s_group_desc[i]);
 	kfree(sbi->s_group_desc);
 failed_mount:
+#ifdef CONFIG_QUOTA
+	for (i = 0; i < MAXQUOTAS; i++) {
+		if (sbi->s_qf_names[i])
+			kfree(sbi->s_qf_names[i]);
+	}
+#endif
 	ext3_blkdev_remove(sbi);
 	brelse(bh);
 out_fail:
@@ -1832,7 +1994,7 @@
 	/*
 	 * Allow the "check" option to be passed as a remount option.
 	 */
-	if (!parse_options(data, sbi, &tmp, 1))
+	if (!parse_options(data, sb, &tmp, 1))
 		return -EINVAL;
 
 	if (sbi->s_mount_opt & EXT3_MOUNT_ABORT)
@@ -1952,70 +2114,152 @@
 
 #ifdef CONFIG_QUOTA
 
-/* Blocks: (2 data blocks) * (3 indirect + 1 descriptor + 1 bitmap) + superblock */
-#define EXT3_OLD_QFMT_BLOCKS 11
-/* Blocks: quota info + (4 pointer blocks + 1 entry block) * (3 indirect + 1 descriptor + 1 bitmap) + superblock */
-#define EXT3_V0_QFMT_BLOCKS 27
-
-static int (*old_write_dquot)(struct dquot *dquot);
-static void (*old_drop_dquot)(struct inode *inode);
-
-static int fmt_to_blocks(int fmt)
-{
-	switch (fmt) {
-		case QFMT_VFS_OLD:
-			return  EXT3_OLD_QFMT_BLOCKS;
-		case QFMT_VFS_V0:
-			return EXT3_V0_QFMT_BLOCKS;
-	}
-	return EXT3_MAX_TRANS_DATA;
+static inline struct inode *dquot_to_inode(struct dquot *dquot)
+{
+	return sb_dqopt(dquot->dq_sb)->files[dquot->dq_type]->f_dentry->d_inode;
 }
 
-static int ext3_write_dquot(struct dquot *dquot)
+static int ext3_dquot_initialize(struct inode *inode, int type)
 {
-	int nblocks;
-	int ret;
-	int err;
 	handle_t *handle;
-	struct quota_info *dqopt = sb_dqopt(dquot->dq_sb);
-	struct inode *qinode;
+	int ret, err;
 
-	nblocks = fmt_to_blocks(dqopt->info[dquot->dq_type].dqi_format->qf_fmt_id);
-	qinode = dqopt->files[dquot->dq_type]->f_dentry->d_inode;
-	handle = ext3_journal_start(qinode, nblocks);
-	if (IS_ERR(handle)) {
-		ret = PTR_ERR(handle);
-		goto out;
-	}
-	ret = old_write_dquot(dquot);
+	/* We may create quota structure so we need to reserve enough blocks */
+	handle = ext3_journal_start(inode, 2*EXT3_QUOTA_INIT_BLOCKS);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+	ret = dquot_initialize(inode, type);
 	err = ext3_journal_stop(handle);
-	if (ret == 0)
+	if (!ret)
 		ret = err;
-out:
 	return ret;
 }
 
-static void ext3_drop_dquot(struct inode *inode)
+static int ext3_dquot_drop(struct inode *inode)
 {
-	int nblocks, type;
-	struct quota_info *dqopt = sb_dqopt(inode->i_sb);
 	handle_t *handle;
+	int ret, err;
 
-	for (type = 0; type < MAXQUOTAS; type++) {
-		if (sb_has_quota_enabled(inode->i_sb, type))
-			break;
-	}
-	if (type < MAXQUOTAS)
-		nblocks = fmt_to_blocks(dqopt->info[type].dqi_format->qf_fmt_id);
+	/* We may delete quota structure so we need to reserve enough blocks */
+	handle = ext3_journal_start(inode, 2*EXT3_QUOTA_INIT_BLOCKS);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+	ret = dquot_drop(inode);
+	err = ext3_journal_stop(handle);
+	if (!ret)
+		ret = err;
+	return ret;
+}
+
+static int ext3_write_dquot(struct dquot *dquot)
+{
+	int ret, err;
+	handle_t *handle;
+
+	handle = ext3_journal_start(dquot_to_inode(dquot),
+					EXT3_QUOTA_TRANS_BLOCKS);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+	ret = dquot_commit(dquot);
+	err = ext3_journal_stop(handle);
+	if (!ret)
+		ret = err;
+	return ret;
+}
+
+static int ext3_mark_dquot_dirty(struct dquot * dquot)
+{
+	/* Are we journalling quotas? */
+	if (EXT3_SB(dquot->dq_sb)->s_qf_names[0] ||
+	    EXT3_SB(dquot->dq_sb)->s_qf_names[1])
+		return ext3_write_dquot(dquot);
 	else
-		nblocks = 0;	/* No quota => no drop */ 
-	handle = ext3_journal_start(inode, 2*nblocks);
+		return dquot_mark_dquot_dirty(dquot);
+}
+
+static int ext3_write_info(struct super_block *sb, int type)
+{
+	int ret, err;
+	handle_t *handle;
+
+	/* Data block + inode block */
+	handle = ext3_journal_start(sb->s_root->d_inode, 2);
 	if (IS_ERR(handle))
-		return;
-	old_drop_dquot(inode);
-	ext3_journal_stop(handle);
-	return;
+		return PTR_ERR(handle);
+	ret = dquot_commit_info(sb, type);
+	err = ext3_journal_stop(handle);
+	if (!ret)
+		ret = err;
+	return ret;
+}
+
+/*
+ * Turn on quotas during mount time - we need to find
+ * the quota file and such...
+ */
+static int ext3_quota_on_mount(struct super_block *sb, int type)
+{
+	int err;
+	struct dentry *dentry;
+	struct qstr name = { .name = EXT3_SB(sb)->s_qf_names[type],
+			     .hash = 0,
+			     .len = strlen(EXT3_SB(sb)->s_qf_names[type])};
+
+	dentry = lookup_hash(&name, sb->s_root);
+	if (IS_ERR(dentry))
+		return PTR_ERR(dentry);
+	err = vfs_quota_on_mount(type, EXT3_SB(sb)->s_jquota_fmt, dentry);
+	if (err)
+		dput(dentry);
+	/* We keep the dentry reference if everything went ok - we drop it
+	 * on quota_off time */
+	return err;
+}
+
+/* Turn quotas off during mount time */
+static int ext3_quota_off_mount(struct super_block *sb, int type)
+{
+	int err;
+	struct dentry *dentry;
+
+	dentry = sb_dqopt(sb)->files[type]->f_dentry;
+	err = vfs_quota_off_mount(sb, type);
+	/* We invalidate dentry - it has at least wrong hash... */
+	d_invalidate(dentry);
+	dput(dentry);
+	return err;
+}
+
+/*
+ * Standard function to be called on quota_on
+ */
+static int ext3_quota_on(struct super_block *sb, int type, int format_id,
+			 char *path)
+{
+	int err;
+	struct nameidata nd;
+
+	/* Not journalling quota? */
+	if (!EXT3_SB(sb)->s_qf_names[0] && !EXT3_SB(sb)->s_qf_names[1])
+		return vfs_quota_on(sb, type, format_id, path);
+	err = path_lookup(path, LOOKUP_FOLLOW, &nd);
+	if (err)
+		return err;
+	/* Quotafile not on the same filesystem? */
+	if (nd.mnt->mnt_sb != sb)
+		return -EXDEV;
+	/* Quotafile not of fs root? */
+	if (nd.dentry->d_parent->d_inode != sb->s_root->d_inode)
+		printk(KERN_WARNING
+			"EXT3-fs: Quota file not on filesystem root. "
+			"Journalled quota will not work.\n");
+	if (!ext3_should_journal_data(nd.dentry->d_inode))
+		printk(KERN_WARNING "EXT3-fs: Quota file does not have "
+			"data-journalling. Journalled quota will not work.\n");
+	path_release(&nd);
+	return vfs_quota_on(sb, type, format_id, path);
 }
+
 #endif
 
 static struct super_block *ext3_get_sb(struct file_system_type *fs_type,
@@ -2040,13 +2284,6 @@
 	err = init_inodecache();
 	if (err)
 		goto out1;
-#ifdef CONFIG_QUOTA
-	init_dquot_operations(&ext3_qops);
-	old_write_dquot = ext3_qops.write_dquot;
-	old_drop_dquot = ext3_qops.drop;
-	ext3_qops.write_dquot = ext3_write_dquot;
-	ext3_qops.drop = ext3_drop_dquot;
-#endif
         err = register_filesystem(&ext3_fs_type);
 	if (err)
 		goto out;
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4-1-lockfix/fs/Kconfig linux-2.6.4-2-jquota/fs/Kconfig
--- linux-2.6.4-1-lockfix/fs/Kconfig	2004-03-22 22:12:54.000000000 +0100
+++ linux-2.6.4-2-jquota/fs/Kconfig	2004-03-23 00:18:16.000000000 +0100
@@ -406,12 +406,15 @@
 	help
 	  If you say Y here, you will be able to set per user limits for disk
 	  usage (also called disk quotas). Currently, it works for the
-	  ext2, ext3, and reiserfs file system. You need additional software
-	  in order to use quota support (you can download sources from
+	  ext2, ext3, and reiserfs file system. ext3 also supports journalled
+	  quotas for which you don't need to run quotacheck(8) after an unclean
+	  shutdown. You need additional software in order to use quota support
+	  (you can download sources from
 	  <http://www.sf.net/projects/linuxquota/>). For further details, read
 	  the Quota mini-HOWTO, available from
-	  <http://www.tldp.org/docs.html#howto>. Probably the quota
-	  support is only useful for multi user systems. If unsure, say N.
+	  <http://www.tldp.org/docs.html#howto>, or the documentation provided
+	  with the quota tools. Probably the quota support is only useful for
+	  multi user systems. If unsure, say N.
 
 config QFMT_V1
 	tristate "Old quota format support"
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4-1-lockfix/fs/quota_v1.c linux-2.6.4-2-jquota/fs/quota_v1.c
--- linux-2.6.4-1-lockfix/fs/quota_v1.c	2003-11-26 21:44:57.000000000 +0100
+++ linux-2.6.4-2-jquota/fs/quota_v1.c	2004-03-22 21:29:49.000000000 +0100
@@ -60,7 +60,7 @@
 	v1_disk2mem_dqblk(&dquot->dq_dqb, &dqblk);
 	if (dquot->dq_dqb.dqb_bhardlimit == 0 && dquot->dq_dqb.dqb_bsoftlimit == 0 &&
 	    dquot->dq_dqb.dqb_ihardlimit == 0 && dquot->dq_dqb.dqb_isoftlimit == 0)
-		dquot->dq_flags |= DQ_FAKE;
+		set_bit(DQ_FAKE_B, &dquot->dq_flags);
 	dqstats.reads++;
 
 	return 0;
@@ -80,12 +80,7 @@
 	fs = get_fs();
 	set_fs(KERNEL_DS);
 
-	/*
-	 * Note: clear the DQ_MOD flag unconditionally,
-	 * so we don't loop forever on failure.
-	 */
 	v1_mem2disk_dqblk(&dqblk, &dquot->dq_dqb);
-	dquot->dq_flags &= ~DQ_MOD;
 	if (dquot->dq_id == 0) {
 		dqblk.dqb_btime = sb_dqopt(dquot->dq_sb)->info[type].dqi_bgrace;
 		dqblk.dqb_itime = sb_dqopt(dquot->dq_sb)->info[type].dqi_igrace;
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4-1-lockfix/fs/quota_v2.c linux-2.6.4-2-jquota/fs/quota_v2.c
--- linux-2.6.4-1-lockfix/fs/quota_v2.c	2003-11-26 21:45:20.000000000 +0100
+++ linux-2.6.4-2-jquota/fs/quota_v2.c	2004-03-22 21:29:49.000000000 +0100
@@ -65,7 +65,7 @@
 	set_fs(fs);
 	if (size != sizeof(struct v2_disk_dqinfo)) {
 		printk(KERN_WARNING "Can't read info structure on device %s.\n",
-			f->f_vfsmnt->mnt_sb->s_id);
+			f->f_dentry->d_sb->s_id);
 		return -1;
 	}
 	info->dqi_bgrace = le32_to_cpu(dinfo.dqi_bgrace);
@@ -87,10 +87,12 @@
 	ssize_t size;
 	loff_t offset = V2_DQINFOOFF;
 
+	spin_lock(&dq_data_lock);
 	info->dqi_flags &= ~DQF_INFO_DIRTY;
 	dinfo.dqi_bgrace = cpu_to_le32(info->dqi_bgrace);
 	dinfo.dqi_igrace = cpu_to_le32(info->dqi_igrace);
 	dinfo.dqi_flags = cpu_to_le32(info->dqi_flags & DQF_MASK);
+	spin_unlock(&dq_data_lock);
 	dinfo.dqi_blocks = cpu_to_le32(info->u.v2_i.dqi_blocks);
 	dinfo.dqi_free_blk = cpu_to_le32(info->u.v2_i.dqi_free_blk);
 	dinfo.dqi_free_entry = cpu_to_le32(info->u.v2_i.dqi_free_entry);
@@ -100,7 +102,7 @@
 	set_fs(fs);
 	if (size != sizeof(struct v2_disk_dqinfo)) {
 		printk(KERN_WARNING "Can't write info structure on device %s.\n",
-			f->f_vfsmnt->mnt_sb->s_id);
+			f->f_dentry->d_sb->s_id);
 		return -1;
 	}
 	return 0;
@@ -173,9 +175,10 @@
 }
 
 /* Remove empty block from list and return it */
-static int get_free_dqblk(struct file *filp, struct mem_dqinfo *info)
+static int get_free_dqblk(struct file *filp, int type)
 {
 	dqbuf_t buf = getdqbuf();
+	struct mem_dqinfo *info = sb_dqinfo(filp->f_dentry->d_sb, type);
 	struct v2_disk_dqdbheader *dh = (struct v2_disk_dqdbheader *)buf;
 	int ret, blk;
 
@@ -193,7 +196,7 @@
 			goto out_buf;
 		blk = info->u.v2_i.dqi_blocks++;
 	}
-	mark_info_dirty(info);
+	mark_info_dirty(filp->f_dentry->d_sb, type);
 	ret = blk;
 out_buf:
 	freedqbuf(buf);
@@ -201,8 +204,9 @@
 }
 
 /* Insert empty block to the list */
-static int put_free_dqblk(struct file *filp, struct mem_dqinfo *info, dqbuf_t buf, uint blk)
+static int put_free_dqblk(struct file *filp, int type, dqbuf_t buf, uint blk)
 {
+	struct mem_dqinfo *info = sb_dqinfo(filp->f_dentry->d_sb, type);
 	struct v2_disk_dqdbheader *dh = (struct v2_disk_dqdbheader *)buf;
 	int err;
 
@@ -210,16 +214,17 @@
 	dh->dqdh_prev_free = cpu_to_le32(0);
 	dh->dqdh_entries = cpu_to_le16(0);
 	info->u.v2_i.dqi_free_blk = blk;
-	mark_info_dirty(info);
+	mark_info_dirty(filp->f_dentry->d_sb, type);
 	if ((err = write_blk(filp, blk, buf)) < 0)	/* Some strange block. We had better leave it... */
 		return err;
 	return 0;
 }
 
 /* Remove given block from the list of blocks with free entries */
-static int remove_free_dqentry(struct file *filp, struct mem_dqinfo *info, dqbuf_t buf, uint blk)
+static int remove_free_dqentry(struct file *filp, int type, dqbuf_t buf, uint blk)
 {
 	dqbuf_t tmpbuf = getdqbuf();
+	struct mem_dqinfo *info = sb_dqinfo(filp->f_dentry->d_sb, type);
 	struct v2_disk_dqdbheader *dh = (struct v2_disk_dqdbheader *)buf;
 	uint nextblk = le32_to_cpu(dh->dqdh_next_free), prevblk = le32_to_cpu(dh->dqdh_prev_free);
 	int err;
@@ -242,7 +247,7 @@
 	}
 	else {
 		info->u.v2_i.dqi_free_entry = nextblk;
-		mark_info_dirty(info);
+		mark_info_dirty(filp->f_dentry->d_sb, type);
 	}
 	freedqbuf(tmpbuf);
 	dh->dqdh_next_free = dh->dqdh_prev_free = cpu_to_le32(0);
@@ -255,9 +260,10 @@
 }
 
 /* Insert given block to the beginning of list with free entries */
-static int insert_free_dqentry(struct file *filp, struct mem_dqinfo *info, dqbuf_t buf, uint blk)
+static int insert_free_dqentry(struct file *filp, int type, dqbuf_t buf, uint blk)
 {
 	dqbuf_t tmpbuf = getdqbuf();
+	struct mem_dqinfo *info = sb_dqinfo(filp->f_dentry->d_sb, type);
 	struct v2_disk_dqdbheader *dh = (struct v2_disk_dqdbheader *)buf;
 	int err;
 
@@ -276,7 +282,7 @@
 	}
 	freedqbuf(tmpbuf);
 	info->u.v2_i.dqi_free_entry = blk;
-	mark_info_dirty(info);
+	mark_info_dirty(filp->f_dentry->d_sb, type);
 	return 0;
 out_buf:
 	freedqbuf(tmpbuf);
@@ -307,7 +313,7 @@
 			goto out_buf;
 	}
 	else {
-		blk = get_free_dqblk(filp, info);
+		blk = get_free_dqblk(filp, dquot->dq_type);
 		if ((int)blk < 0) {
 			*err = blk;
 			freedqbuf(buf);
@@ -315,10 +321,10 @@
 		}
 		memset(buf, 0, V2_DQBLKSIZE);
 		info->u.v2_i.dqi_free_entry = blk;	/* This is enough as block is already zeroed and entry list is empty... */
-		mark_info_dirty(info);
+		mark_info_dirty(dquot->dq_sb, dquot->dq_type);
 	}
 	if (le16_to_cpu(dh->dqdh_entries)+1 >= V2_DQSTRINBLK)	/* Block will be full? */
-		if ((*err = remove_free_dqentry(filp, info, buf, blk)) < 0) {
+		if ((*err = remove_free_dqentry(filp, dquot->dq_type, buf, blk)) < 0) {
 			printk(KERN_ERR "VFS: find_free_dqentry(): Can't remove block (%u) from entry free list.\n", blk);
 			goto out_buf;
 		}
@@ -349,7 +355,6 @@
 static int do_insert_tree(struct dquot *dquot, uint *treeblk, int depth)
 {
 	struct file *filp = sb_dqopt(dquot->dq_sb)->files[dquot->dq_type];
-	struct mem_dqinfo *info = sb_dqopt(dquot->dq_sb)->info + dquot->dq_type;
 	dqbuf_t buf;
 	int ret = 0, newson = 0, newact = 0;
 	u32 *ref;
@@ -358,7 +363,7 @@
 	if (!(buf = getdqbuf()))
 		return -ENOMEM;
 	if (!*treeblk) {
-		ret = get_free_dqblk(filp, info);
+		ret = get_free_dqblk(filp, dquot->dq_type);
 		if (ret < 0)
 			goto out_buf;
 		*treeblk = ret;
@@ -392,7 +397,7 @@
 		ret = write_blk(filp, *treeblk, buf);
 	}
 	else if (newact && ret < 0)
-		put_free_dqblk(filp, info, buf, *treeblk);
+		put_free_dqblk(filp, dquot->dq_type, buf, *treeblk);
 out_buf:
 	freedqbuf(buf);
 	return ret;
@@ -417,6 +422,7 @@
 	ssize_t ret;
 	struct v2_disk_dqblk ddquot;
 
+	/* dq_off is guarded by dqio_sem */
 	if (!dquot->dq_off)
 		if ((ret = dq_insert_tree(dquot)) < 0) {
 			printk(KERN_ERR "VFS: Error %Zd occurred while creating quota.\n", ret);
@@ -424,7 +430,9 @@
 		}
 	filp = sb_dqopt(dquot->dq_sb)->files[type];
 	offset = dquot->dq_off;
+	spin_lock(&dq_data_lock);
 	mem2diskdqb(&ddquot, &dquot->dq_dqb, dquot->dq_id);
+	spin_unlock(&dq_data_lock);
 	fs = get_fs();
 	set_fs(KERNEL_DS);
 	ret = filp->f_op->write(filp, (char *)&ddquot, sizeof(struct v2_disk_dqblk), &offset);
@@ -445,7 +453,6 @@
 static int free_dqentry(struct dquot *dquot, uint blk)
 {
 	struct file *filp = sb_dqopt(dquot->dq_sb)->files[dquot->dq_type];
-	struct mem_dqinfo *info = sb_dqopt(dquot->dq_sb)->info + dquot->dq_type;
 	struct v2_disk_dqdbheader *dh;
 	dqbuf_t buf = getdqbuf();
 	int ret = 0;
@@ -463,8 +470,8 @@
 	dh = (struct v2_disk_dqdbheader *)buf;
 	dh->dqdh_entries = cpu_to_le16(le16_to_cpu(dh->dqdh_entries)-1);
 	if (!le16_to_cpu(dh->dqdh_entries)) {	/* Block got free? */
-		if ((ret = remove_free_dqentry(filp, info, buf, blk)) < 0 ||
-		    (ret = put_free_dqblk(filp, info, buf, blk)) < 0) {
+		if ((ret = remove_free_dqentry(filp, dquot->dq_type, buf, blk)) < 0 ||
+		    (ret = put_free_dqblk(filp, dquot->dq_type, buf, blk)) < 0) {
 			printk(KERN_ERR "VFS: Can't move quota data block (%u) to free list.\n", blk);
 			goto out_buf;
 		}
@@ -473,7 +480,7 @@
 		memset(buf+(dquot->dq_off & ((1 << V2_DQBLKSIZE_BITS)-1)), 0, sizeof(struct v2_disk_dqblk));
 		if (le16_to_cpu(dh->dqdh_entries) == V2_DQSTRINBLK-1) {
 			/* Insert will write block itself */
-			if ((ret = insert_free_dqentry(filp, info, buf, blk)) < 0) {
+			if ((ret = insert_free_dqentry(filp, dquot->dq_type, buf, blk)) < 0) {
 				printk(KERN_ERR "VFS: Can't insert quota data block (%u) to free entry list.\n", blk);
 				goto out_buf;
 			}
@@ -494,7 +501,6 @@
 static int remove_tree(struct dquot *dquot, uint *blk, int depth)
 {
 	struct file *filp = sb_dqopt(dquot->dq_sb)->files[dquot->dq_type];
-	struct mem_dqinfo *info = sb_dqopt(dquot->dq_sb)->info + dquot->dq_type;
 	dqbuf_t buf = getdqbuf();
 	int ret = 0;
 	uint newblk;
@@ -518,7 +524,7 @@
 		ref[GETIDINDEX(dquot->dq_id, depth)] = cpu_to_le32(0);
 		for (i = 0; i < V2_DQBLKSIZE && !buf[i]; i++);	/* Block got empty? */
 		if (i == V2_DQBLKSIZE) {
-			put_free_dqblk(filp, info, buf, *blk);
+			put_free_dqblk(filp, dquot->dq_type, buf, *blk);
 			*blk = 0;
 		}
 		else
@@ -632,7 +638,7 @@
 		if (offset < 0)
 			printk(KERN_ERR "VFS: Can't read quota structure for id %u.\n", dquot->dq_id);
 		dquot->dq_off = 0;
-		dquot->dq_flags |= DQ_FAKE;
+		set_bit(DQ_FAKE_B, &dquot->dq_flags);
 		memset(&dquot->dq_dqb, 0, sizeof(struct mem_dqblk));
 		ret = offset;
 	}
@@ -650,21 +656,24 @@
 			ret = 0;
 		set_fs(fs);
 		disk2memdqb(&dquot->dq_dqb, &ddquot);
+		if (!dquot->dq_dqb.dqb_bhardlimit &&
+			!dquot->dq_dqb.dqb_bsoftlimit &&
+			!dquot->dq_dqb.dqb_ihardlimit &&
+			!dquot->dq_dqb.dqb_isoftlimit)
+			set_bit(DQ_FAKE_B, &dquot->dq_flags);
 	}
 	dqstats.reads++;
 
 	return ret;
 }
 
-/* Commit changes of dquot to disk - it might also mean deleting it when quota became fake one and user has no blocks... */
-static int v2_commit_dquot(struct dquot *dquot)
+/* Check whether dquot should not be deleted. We know we are
+ * the only one operating on dquot (thanks to dq_lock) */
+static int v2_release_dquot(struct dquot *dquot)
 {
-	/* We clear the flag everytime so we don't loop when there was an IO error... */
-	dquot->dq_flags &= ~DQ_MOD;
-	if (dquot->dq_flags & DQ_FAKE && !(dquot->dq_dqb.dqb_curinodes | dquot->dq_dqb.dqb_curspace))
+	if (test_bit(DQ_FAKE_B, &dquot->dq_flags) && !(dquot->dq_dqb.dqb_curinodes | dquot->dq_dqb.dqb_curspace))
 		return v2_delete_dquot(dquot);
-	else
-		return v2_write_dquot(dquot);
+	return 0;
 }
 
 static struct quota_format_ops v2_format_ops = {
@@ -673,7 +682,8 @@
 	.write_file_info	= v2_write_file_info,
 	.free_file_info		= NULL,
 	.read_dqblk		= v2_read_dquot,
-	.commit_dqblk		= v2_commit_dquot,
+	.commit_dqblk		= v2_write_dquot,
+	.release_dqblk		= v2_release_dquot,
 };
 
 static struct quota_format_type v2_quota_format = {
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4-1-lockfix/fs/stat.c linux-2.6.4-2-jquota/fs/stat.c
--- linux-2.6.4-1-lockfix/fs/stat.c	2004-03-17 09:46:59.000000000 +0100
+++ linux-2.6.4-2-jquota/fs/stat.c	2004-03-22 21:29:49.000000000 +0100
@@ -397,6 +397,8 @@
 
 void inode_set_bytes(struct inode *inode, loff_t bytes)
 {
+	/* Caller is here responsible for sufficient locking
+	 * (ie. inode->i_lock) */
 	inode->i_blocks = bytes >> 9;
 	inode->i_bytes = bytes & 511;
 }
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4-1-lockfix/include/linux/ext3_fs_sb.h linux-2.6.4-2-jquota/include/linux/ext3_fs_sb.h
--- linux-2.6.4-1-lockfix/include/linux/ext3_fs_sb.h	2004-03-04 09:26:40.000000000 +0100
+++ linux-2.6.4-2-jquota/include/linux/ext3_fs_sb.h	2004-03-22 21:29:49.000000000 +0100
@@ -69,6 +69,10 @@
 	struct timer_list turn_ro_timer;	/* For turning read-only (crash simulation) */
 	wait_queue_head_t ro_wait_queue;	/* For people waiting for the fs to go read-only */
 #endif
+#ifdef CONFIG_QUOTA
+	char *s_qf_names[MAXQUOTAS];		/* Names of quota files with journalled quota */
+	int s_jquota_fmt;			/* Format of quota to use */
+#endif
 };
 
 #endif	/* _LINUX_EXT3_FS_SB */
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4-1-lockfix/include/linux/ext3_jbd.h linux-2.6.4-2-jquota/include/linux/ext3_jbd.h
--- linux-2.6.4-1-lockfix/include/linux/ext3_jbd.h	2004-03-04 09:26:40.000000000 +0100
+++ linux-2.6.4-2-jquota/include/linux/ext3_jbd.h	2004-03-22 21:29:49.000000000 +0100
@@ -42,8 +42,9 @@
  * superblock only gets updated once, of course, so don't bother
  * counting that again for the quota updates. */
 
-#define EXT3_DATA_TRANS_BLOCKS		(3 * EXT3_SINGLEDATA_TRANS_BLOCKS + \
-					 EXT3_XATTR_TRANS_BLOCKS - 2)
+#define EXT3_DATA_TRANS_BLOCKS		(EXT3_SINGLEDATA_TRANS_BLOCKS + \
+					 EXT3_XATTR_TRANS_BLOCKS - 2 + \
+					 2*EXT3_QUOTA_TRANS_BLOCKS)
 
 extern int ext3_writepage_trans_blocks(struct inode *inode);
 
@@ -72,6 +73,19 @@
 
 #define EXT3_INDEX_EXTRA_TRANS_BLOCKS	8
 
+#ifdef CONFIG_QUOTA
+/* Amount of blocks needed for quota update - we know that the structure was
+ * allocated so we need to update only inode+data */
+#define EXT3_QUOTA_TRANS_BLOCKS 2
+/* Amount of blocks needed for quota insert/delete - we do some block writes
+ * but inode, sb and group updates are done only once */
+#define EXT3_QUOTA_INIT_BLOCKS (DQUOT_MAX_WRITES*\
+				(EXT3_SINGLEDATA_TRANS_BLOCKS-3)+3)
+#else
+#define EXT3_QUOTA_TRANS_BLOCKS 0
+#define EXT3_QUOTA_INIT_BLOCKS 0
+#endif
+
 int
 ext3_mark_iloc_dirty(handle_t *handle, 
 		     struct inode *inode,
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4-1-lockfix/include/linux/quota.h linux-2.6.4-2-jquota/include/linux/quota.h
--- linux-2.6.4-1-lockfix/include/linux/quota.h	2004-03-04 09:26:00.000000000 +0100
+++ linux-2.6.4-2-jquota/include/linux/quota.h	2004-03-22 21:29:49.000000000 +0100
@@ -138,6 +138,10 @@
 #include <linux/dqblk_v1.h>
 #include <linux/dqblk_v2.h>
 
+/* Maximal numbers of writes for quota operation (insert/delete/update)
+ * (over all formats) - info block, 4 pointer blocks, data block */
+#define DQUOT_MAX_WRITES	6
+
 /*
  * Data for one user/group kept in memory
  */
@@ -168,22 +172,21 @@
 	} u;
 };
 
+struct super_block;
+
 #define DQF_MASK 0xffff		/* Mask for format specific flags */
 #define DQF_INFO_DIRTY_B 16
 #define DQF_ANY_DQUOT_DIRTY_B 17
 #define DQF_INFO_DIRTY (1 << DQF_INFO_DIRTY_B)	/* Is info dirty? */
 #define DQF_ANY_DQUOT_DIRTY (1 << DQF_ANY_DQUOT_DIRTY_B) /* Is any dquot dirty? */
 
-extern inline void mark_info_dirty(struct mem_dqinfo *info)
-{
-	set_bit(DQF_INFO_DIRTY_B, &info->dqi_flags);
-}
-
+extern void mark_info_dirty(struct super_block *sb, int type);
 #define info_dirty(info) test_bit(DQF_INFO_DIRTY_B, &(info)->dqi_flags)
 #define info_any_dquot_dirty(info) test_bit(DQF_ANY_DQUOT_DIRTY_B, &(info)->dqi_flags)
 #define info_any_dirty(info) (info_dirty(info) || info_any_dquot_dirty(info))
 
 #define sb_dqopt(sb) (&(sb)->s_dquot)
+#define sb_dqinfo(sb, type) (sb_dqopt(sb)->info+(type))
 
 struct dqstats {
 	int lookups;
@@ -200,15 +203,13 @@
 
 #define NR_DQHASH 43            /* Just an arbitrary number */
 
-#define DQ_MOD_B	0
-#define DQ_BLKS_B	1
-#define DQ_INODES_B	2
-#define DQ_FAKE_B	3
-
-#define DQ_MOD        (1 << DQ_MOD_B)	/* dquot modified since read */
-#define DQ_BLKS       (1 << DQ_BLKS_B)	/* uid/gid has been warned about blk limit */
-#define DQ_INODES     (1 << DQ_INODES_B)	/* uid/gid has been warned about inode limit */
-#define DQ_FAKE       (1 << DQ_FAKE_B)	/* no limits only usage */
+#define DQ_MOD_B	0	/* dquot modified since read */
+#define DQ_BLKS_B	1	/* uid/gid has been warned about blk limit */
+#define DQ_INODES_B	2	/* uid/gid has been warned about inode limit */
+#define DQ_FAKE_B	3	/* no limits only usage */
+#define DQ_READ_B	4	/* dquot was read into memory */
+#define DQ_ACTIVE_B	5	/* dquot is active (dquot_release not called) */
+#define DQ_WAITFREE_B	6	/* dquot being waited (by invalidate_dquots) */
 
 struct dquot {
 	struct list_head dq_hash;	/* Hash list in memory */
@@ -216,8 +217,7 @@
 	struct list_head dq_free;	/* Free list element */
 	struct semaphore dq_lock;	/* dquot IO lock */
 	atomic_t dq_count;		/* Use count */
-
-	/* fields after this point are cleared when invalidating */
+	wait_queue_head_t dq_wait_unused;	/* Wait queue for dquot to become unused */
 	struct super_block *dq_sb;	/* superblock this applies to */
 	unsigned int dq_id;		/* ID this applies to (uid, gid) */
 	loff_t dq_off;			/* Offset of dquot on disk */
@@ -238,19 +238,22 @@
 	int (*write_file_info)(struct super_block *sb, int type);	/* Write main info about file */
 	int (*free_file_info)(struct super_block *sb, int type);	/* Called on quotaoff() */
 	int (*read_dqblk)(struct dquot *dquot);		/* Read structure for one user */
-	int (*commit_dqblk)(struct dquot *dquot);	/* Write (or delete) structure for one user */
+	int (*commit_dqblk)(struct dquot *dquot);	/* Write structure for one user */
+	int (*release_dqblk)(struct dquot *dquot);	/* Called when last reference to dquot is being dropped */
 };
 
 /* Operations working with dquots */
 struct dquot_operations {
-	void (*initialize) (struct inode *, int);
-	void (*drop) (struct inode *);
+	int (*initialize) (struct inode *, int);
+	int (*drop) (struct inode *);
 	int (*alloc_space) (struct inode *, qsize_t, int);
 	int (*alloc_inode) (const struct inode *, unsigned long);
-	void (*free_space) (struct inode *, qsize_t);
-	void (*free_inode) (const struct inode *, unsigned long);
+	int (*free_space) (struct inode *, qsize_t);
+	int (*free_inode) (const struct inode *, unsigned long);
 	int (*transfer) (struct inode *, struct iattr *);
-	int (*write_dquot) (struct dquot *);
+	int (*write_dquot) (struct dquot *);		/* Ordinary dquot write */
+	int (*mark_dirty) (struct dquot *);		/* Dquot is marked dirty */
+	int (*write_info) (struct super_block *, int);	/* Write of quota "superblock" */
 };
 
 /* Operations handling requests from userspace */
@@ -289,10 +292,7 @@
 };
 
 /* Inline would be better but we need to dereference super_block which is not defined yet */
-#define mark_dquot_dirty(dquot) do {\
-	set_bit(DQF_ANY_DQUOT_DIRTY_B, &(sb_dqopt((dquot)->dq_sb)->info[(dquot)->dq_type].dqi_flags));\
-	set_bit(DQ_MOD_B, &(dquot)->dq_flags);\
-} while (0)
+int mark_dquot_dirty(struct dquot *dquot);
 
 #define dquot_dirty(dquot) test_bit(DQ_MOD_B, &(dquot)->dq_flags)
 
@@ -304,7 +304,6 @@
 
 int register_quota_format(struct quota_format_type *fmt);
 void unregister_quota_format(struct quota_format_type *fmt);
-void init_dquot_operations(struct dquot_operations *fsdqops);
 
 struct quota_module_name {
 	int qm_fmt_id;
diff -ruX /home/jack/.kerndiffexclude linux-2.6.4-1-lockfix/include/linux/quotaops.h linux-2.6.4-2-jquota/include/linux/quotaops.h
--- linux-2.6.4-1-lockfix/include/linux/quotaops.h	2004-03-17 10:37:23.000000000 +0100
+++ linux-2.6.4-2-jquota/include/linux/quotaops.h	2004-03-22 21:29:49.000000000 +0100
@@ -22,16 +22,31 @@
  */
 extern void sync_dquots(struct super_block *sb, int type);
 
-extern void dquot_initialize(struct inode *inode, int type);
-extern void dquot_drop(struct inode *inode);
+extern int dquot_initialize(struct inode *inode, int type);
+extern int dquot_drop(struct inode *inode);
 
-extern int  dquot_alloc_space(struct inode *inode, qsize_t number, int prealloc);
-extern int  dquot_alloc_inode(const struct inode *inode, unsigned long number);
+extern int dquot_alloc_space(struct inode *inode, qsize_t number, int prealloc);
+extern int dquot_alloc_inode(const struct inode *inode, unsigned long number);
 
-extern void dquot_free_space(struct inode *inode, qsize_t number);
-extern void dquot_free_inode(const struct inode *inode, unsigned long number);
+extern int dquot_free_space(struct inode *inode, qsize_t number);
+extern int dquot_free_inode(const struct inode *inode, unsigned long number);
 
-extern int  dquot_transfer(struct inode *inode, struct iattr *iattr);
+extern int dquot_transfer(struct inode *inode, struct iattr *iattr);
+extern int dquot_commit(struct dquot *dquot);
+extern int dquot_acquire(struct dquot *dquot);
+extern int dquot_release(struct dquot *dquot);
+extern int dquot_commit_info(struct super_block *sb, int type);
+extern int dquot_mark_dquot_dirty(struct dquot *dquot);
+
+extern int vfs_quota_on(struct super_block *sb, int type, int format_id, char *path);
+extern int vfs_quota_on_mount(int type, int format_id, struct dentry *dentry);
+extern int vfs_quota_off(struct super_block *sb, int type);
+#define vfs_quota_off_mount(sb, type) vfs_quota_off(sb, type)
+extern int vfs_quota_sync(struct super_block *sb, int type);
+extern int vfs_get_dqinfo(struct super_block *sb, int type, struct if_dqinfo *ii);
+extern int vfs_set_dqinfo(struct super_block *sb, int type, struct if_dqinfo *ii);
+extern int vfs_get_dqblk(struct super_block *sb, int type, qid_t id, struct if_dqblk *di);
+extern int vfs_set_dqblk(struct super_block *sb, int type, qid_t id, struct if_dqblk *di);
 
 /*
  * Operations supported for diskquotas.
@@ -42,6 +57,8 @@
 #define sb_dquot_ops (&dquot_operations)
 #define sb_quotactl_ops (&vfs_quotactl_ops)
 
+/* It is better to call this function outside of any transaction as it might
+ * need a lot of space in journal for dquot structure allocation. */
 static __inline__ void DQUOT_INIT(struct inode *inode)
 {
 	BUG_ON(!inode->i_sb);
@@ -49,6 +66,7 @@
 		inode->i_sb->dq_op->initialize(inode, -1);
 }
 
+/* The same as with DQUOT_INIT */
 static __inline__ void DQUOT_DROP(struct inode *inode)
 {
 	if (IS_QUOTAINIT(inode)) {
@@ -57,6 +75,8 @@
 	}
 }
 
+/* The following allocation/freeing/transfer functions *must* be called inside
+ * a transaction (deadlocks possible otherwise) */
 static __inline__ int DQUOT_PREALLOC_SPACE_NODIRTY(struct inode *inode, qsize_t nr)
 {
 	if (sb_any_quota_enabled(inode->i_sb)) {
@@ -137,6 +157,7 @@
 	return 0;
 }
 
+/* The following two functions cannot be called inside a transaction */
 #define DQUOT_SYNC(sb)	sync_dquots(sb, -1)
 
 static __inline__ int DQUOT_OFF(struct super_block *sb)

From crosser at rol.ru  Mon Apr 19 18:27:05 2004
From: crosser at rol.ru (Eugene Crosser)
Date: Mon, 19 Apr 2004 22:27:05 +0400
Subject: [patch] Re: stalled 'sync' on ext3+quota over drbd
In-Reply-To: <1082385443.17175.183.camel@ariel.sovam.com>
References: <1080125239.4717.33.camel@ariel.sovam.com>
	<1080737188.1991.9.camel@sisko.scot.redhat.com>
	<1080738345.22942.53.camel@ariel.sovam.com>
	<1080740974.1991.28.camel@sisko.scot.redhat.com>
	<1081177587.7677.110.camel@ariel.sovam.com>
	<1081255826.22308.57.camel@ariel.sovam.com>
	<1082150363.2081.85.camel@sisko.scot.redhat.com>
	<1082198452.20346.25.camel@pccross.average.org>
	<20040419133807.GB15541@atrey.karlin.mff.cuni.cz>
	<1082385443.17175.183.camel@ariel.sovam.com>
Message-ID: <1082399225.1582.11.camel@ariel.sovam.com>

On Mon, 2004-04-19 at 18:37, Eugene Crosser wrote:

> > > > An obvious cure is to shift the start of the list to point just after
> > > > the item just synced.  I've done only limited testing of this patch, but
> > > > does it help for you?
> > > 
> > > Cool!  I've already began to build testing environment with oprofile
> > > enabled ;-)  During the weekend, I am out of the office, but I'll
> > > certainly verify your fix on Monday.
> >   Do you already have results? I'd be interested in them...
> 
> From the first impression, it did not help.

Luckily, I was wrong.  At least on the test environment I get the
results below.  This is 2.6.5 kernel with euivalent of Stephen's patch. 
'setquota' sets user and group quota for 40,000 userids and 40,000
groupids.  'mktree' writes one byte into 40,000 files and sets their
owner and group to 40,000 unique values.

			unpatched		patched
setquota		4m2.9			3m42
sync			0m0.1			0m0.1
mktree(1)		7m3.8			7m50
sync			7m50			0m0.8
mktree(2)		7m23			6m40
sync			7m49			0m1

On my 'big' system sync still runs unexpectedly long (40sec - 4min) but
it is by far better than it was before the patch...

Thanks everybody involved!

Eugene
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20040419/68f6ef7b/attachment.sig>

From sct at redhat.com  Mon Apr 19 20:43:32 2004
From: sct at redhat.com (Stephen C. Tweedie)
Date: 19 Apr 2004 21:43:32 +0100
Subject: online resize of ext3 possible?
In-Reply-To: <16512.1572.690002.225379@gargle.gargle.HOWL>
References: <16512.1572.690002.225379@gargle.gargle.HOWL>
Message-ID: <1082407412.2237.93.camel@sisko.scot.redhat.com>

Hi,

On Fri, 2004-04-16 at 17:13, John Stoffel wrote:

> Is it possible to resize an ext3 filesystem while it's online?  

Yes and no.  Yes: you want the bits at
http://sourceforge.net/projects/ext2resize/ for it.  But no, it's still
somewhat experimental --- the kernel bits seem robust for me but the
user-land tools still need a bit of work.  For online growth on ext2/3,
you need to prepare space on-disk for the filesystems group descriptor
tables to grow into.  The new format for that is only understood by a
patched mke2fs, so you can't "prepare" an existing filesystem for online
growth just yet.  And e2fsck can't yet properly prepare errors in that
reserved space.

I've *just* done a release of the kernel patches against 2.6.6-rc1 and
the user-space patches against e2fsprogs-1.35, and you're welcome to try
those out.

> It
> looks like resize2fs won't do the trick unless the filesystem is
> unmounted.  

Correct, resize2fs is offline-only.

> And ext2resize takes one look at the filesystem while it's
> mounted and complains as well, this time about un-supported features.

The current one in CVS works for me --- I've added recognition of
extended attributes to that.  A proper release of that is in progress.

Cheers,
 Stephen





From rjwalsh at durables.org  Mon Apr 19 20:47:53 2004
From: rjwalsh at durables.org (Robert Walsh)
Date: Mon, 19 Apr 2004 13:47:53 -0700
Subject: online resize of ext3 possible?
In-Reply-To: <1082407412.2237.93.camel@sisko.scot.redhat.com>
References: <16512.1572.690002.225379@gargle.gargle.HOWL>
	<1082407412.2237.93.camel@sisko.scot.redhat.com>
Message-ID: <1082407673.6514.4.camel@hematite.internal.keyresearch.com>

> > Is it possible to resize an ext3 filesystem while it's online?  
> 
> Yes and no

I remember someone mentioning some new and improved mechanism for doing
online-resize some time back, but I haven't heard much about it since. 
It involved some changes to the ext3 format.  What was all that about? 
Has anything happened since?

Regards,
 Robert.




From adilger at clusterfs.com  Mon Apr 19 21:02:24 2004
From: adilger at clusterfs.com (Andreas Dilger)
Date: Mon, 19 Apr 2004 15:02:24 -0600
Subject: online resize of ext3 possible?
In-Reply-To: <1082407673.6514.4.camel@hematite.internal.keyresearch.com>
References: <16512.1572.690002.225379@gargle.gargle.HOWL>
	<1082407412.2237.93.camel@sisko.scot.redhat.com>
	<1082407673.6514.4.camel@hematite.internal.keyresearch.com>
Message-ID: <20040419210224.GM12357@schnapps.adilger.int>

On Apr 19, 2004  13:47 -0700, Robert Walsh wrote:
> I remember someone mentioning some new and improved mechanism for doing
> online-resize some time back, but I haven't heard much about it since. 
> It involved some changes to the ext3 format.  What was all that about? 
> Has anything happened since?

This is the "meta block group" changes, and compatibility support for that
just went into 2.4.25.  It is in 2.6 and e2fsprogs for a while already.
Nothing actually uses it yet, but in theory the online resizing could
start using it.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/




From tytso at mit.edu  Mon Apr 19 22:28:15 2004
From: tytso at mit.edu (Theodore Ts'o)
Date: Mon, 19 Apr 2004 18:28:15 -0400
Subject: Strange Fedora Booting problem: can not mount
	"LABEL=*"partitions
In-Reply-To: <41089CB27BD8D24E8385C8003EDAF7AB084856@karl.alexa.com>
References: <41089CB27BD8D24E8385C8003EDAF7AB084856@karl.alexa.com>
Message-ID: <20040419222815.GA4897@thunk.org>

On Thu, Apr 15, 2004 at 10:50:03AM -0700, Guolin Cheng wrote:
> 
>  Thanks. But the problem got debugged&fixed, the answer was post on
> fedora-list about 2 weeks ago. 
> 
> The problem is: the /etc/blkid.tab file works as an old unappropriate
> disk partitions cache for fsck|blkid commands when stystem image is
> installed to a different arch (scsi->ide) machine, the old cache will
> mislead fsck|blkid at the first run and only the first run, since the
> first run will update /etc/blkid.tab file. 

What version of e2fsprogs were you testing with?  I've just tested
using the latest version of e2fsprogs, and it works just fine.  In the
test below, I corrupt /etc/blkid.tab by swapping /dev/hda and
/dev/hdc.  This might correspond with might happen after disks get
switched around.  I run fsck in debugging mode to make sure it gets
the correct devices, despite the corrupted /etc/blkid.tab file.  As
you can see, it works fine:

# sed -e 's/hda/hdz/g' -e 's/hdc/hda/g' -e 's/hdz/hdc/' -e 's/0x03/0xf3/' -e s'/0x16/0x03/' -e 's/0xf3/0x16/' /etc/blkid.tab  > /tmp/blkid.tab.broken

# cp /tmp/blkid.tab.broken /etc/blkid.tab

# fsck -AVN -a
fsck 1.35 (28-Feb-2004)
Checking all file systems.
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/hda1
[/sbin/fsck.ext3 (1) -- /usr] fsck.ext3 -a /dev/hda3
[/sbin/fsck.ext3 (1) -- /debian] fsck.ext3 -a /dev/hda5

# e2label /dev/hda1
root

# e2label /dev/hda3
usr

# e2label /dev/hda5
Debian

The only thing I can think of is that you might have been using an
older version of e2fsprogs that was buggy....

						- Ted
-------------- next part --------------
# /etc/fstab: static file system information.
#
# <file system>	<mount point>	<type>	<options>		<dump>	<pass>
LABEL=root	/		ext3	noatime,errors=remount-ro 0	1
/dev/hda2	none		swap	sw			0	0
LABEL=usr	/usr		ext3	noatime,errors=remount-ro 0	2
LABEL=Debian	/debian		ext3	ro,noatime,user		0	2
#/dev/hdc1	/dos		vfat	uid=15806		0	0
proc		/proc		proc	defaults		0	0
binfmt_misc	/proc/sys/fs/binfmt_misc binfmt_misc defaults	0	0
tmp		/tmp		tmpfs	defaults		0	1
/dev/fd0	/floppy		auto	user,noauto		0	0
/dev/cdrom	/cdrom		iso9660	unhide,ro,user,noauto		0	0
fs:/            /fs     nfs     noauto,user,exec,suid,intr      0 0
fs:/vicepa      /fs/vicepa nfs  noauto,user,exec,suid,intr      0 0
fs:/u2          /fs/u2  nfs     noauto,user,exec,suid,intr      0 0
fs:/u3          /fs/u3  nfs     noauto,user,exec,suid,intr      0 0
fs:/u4          /fs/u4  nfs     noauto,user,exec,suid,intr      0 0
fs:/u5          /fs/u5  nfs     noauto,user,exec,suid,intr      0 0
fs:/u6          /fs/u6  nfs     noauto,user,exec,suid,intr      0 0
fs:/usr         /fs/usr nfs     noauto,user,exec,suid,intr      0 0
-------------- next part --------------
<device DEVNO="0x1601" TIME="1182412900" UUID="14484e10-81bb-417a-9ae5-6115bbcdf792" TYPE="ext2">/dev/hdc1</device>
<device DEVNO="0x1602" TIME="1182412900" TYPE="swap">/dev/hdc2</device>
<device DEVNO="0x1603" TIME="1182412900" UUID="afc6b073-ad8b-4440-931d-5558e3618fa9" SEC_TYPE="ext3" TYPE="ext2">/dev/hdc3</device>
<device DEVNO="0x0301" TIME="1182412900" UUID="14484e10-81bb-417a-9ae5-6115bbcdf792" TYPE="ext2" LABEL="root" SEC_TYPE="ext3">/dev/hda1</device>
<device DEVNO="0x0302" TIME="1182412900" TYPE="swap">/dev/hda2</device>
<device DEVNO="0x0303" TIME="1182412900" UUID="afc6b073-ad8b-4440-931d-5558e3618fa9" SEC_TYPE="ext3" TYPE="ext2" LABEL="usr">/dev/hda3</device>
<device DEVNO="0x0305" TIME="1182412900" LABEL="Debian" UUID="eaf43bde-8da2-4844-aed7-80729e93bd13" SEC_TYPE="ext3" TYPE="ext2">/dev/hda5</device>
<device DEVNO="0x0306" TIME="1182412901" UUID="8ee7a0eb-b8e3-4d33-b831-48f1a0d51dc3" SEC_TYPE="ext3" TYPE="ext2">/dev/hda6</device>
<device DEVNO="0x0307" TIME="1182412901" UUID="4036-F32E" TYPE="msdos">/dev/hda7</device>
<device DEVNO="0x1604" TIME="1182412901" UUID="516dee8c-6313-4f25-809d-12f0d16b6614" TYPE="ext2">/dev/hdc4</device>

From guolin at alexa.com  Tue Apr 20 00:03:25 2004
From: guolin at alexa.com (Guolin Cheng)
Date: Mon, 19 Apr 2004 17:03:25 -0700
Subject: Strange Fedora Booting problem: can not mount
	"LABEL=*"partitions
Message-ID: <41089CB27BD8D24E8385C8003EDAF7ABBA48C9@karl.alexa.com>

Hi, Theodore,

 

 Thanks for your tests. But I got the problem and got it FIXED by
flushing out the contents of /etc/blkid.tab before cloned clients
reboot.

 

 I'm using the e2fsprogs-1.34-1 comes with Fedora Core 1.

hello06.alexa.com root 135% rpm -qf /sbin/fsck

e2fsprogs-1.34-1

hello06.alexa.com root 136%

 

 and the original contents in /etc/blkid.tab are attaching below, while
my cloned machine is in fact, a PATA IDE machine. All the /dev/sd*
should be /dev/hd* instead. Thanks.

 

hello06.alexa.com root 168% cat blkid.tab

<device DEVNO="0x0806" TIME="1079722753" TYPE="swap">/dev/sda6</device>

<device DEVNO="0x0807" TIME="1079722753" TYPE="swap">/dev/sda7</device>

<device DEVNO="0x0808" TIME="1079722753" TYPE="swap">/dev/sda8</device>

<device DEVNO="0x080a" TIME="1079722753" LABEL="/var"
UUID="a4e5efd0-a648-472d-a70d-737461f2acf6" SEC_TYPE="ext3"
TYPE="ext2">/dev/sda10</device>

<device DEVNO="0x0811" TIME="1079722753" LABEL="/1"
UUID="9b90d679-275e-4656-916f-21ce963677e7" SEC_TYPE="ext3"
TYPE="ext2">/dev/sdb1</device>

<device DEVNO="0x2201" TIME="1079658729" LABEL="/1"
UUID="9b90d679-275e-4656-916f-21ce963677e7" SEC_TYPE="ext3"
TYPE="ext2">/dev/hdg1</device>

<device DEVNO="0x2106" TIME="1079658729" TYPE="swap">/dev/hde6</device>

<device DEVNO="0x2107" TIME="1079658729" TYPE="swap">/dev/hde7</device>

<device DEVNO="0x2108" TIME="1079658729" TYPE="swap">/dev/hde8</device>

<device DEVNO="0x210a" TIME="1079658729" LABEL="/var"
UUID="a4e5efd0-a648-472d-a70d-737461f2acf6" SEC_TYPE="ext3"
TYPE="ext2">/dev/hde10</device>

<device DEVNO="0x0801" TIME="1079722753" LABEL="/"
UUID="a3866268-8cf6-4b88-b3d9-e9623a1763a4" SEC_TYPE="ext3"
TYPE="ext2">/dev/sda1</device>

<device DEVNO="0x0805" TIME="1079722753" LABEL="/usr"
UUID="6ae04a60-5e15-44d7-a009-dca14f2ad01b" SEC_TYPE="ext3"
TYPE="ext2">/dev/sda5</device>

<device DEVNO="0x0809" TIME="1079722753" LABEL="/alexa"
UUID="06009234-95d5-4088-87c0-5bde8470a8ba" SEC_TYPE="ext3"
TYPE="ext2">/dev/sda9</device>

<device DEVNO="0x080b" TIME="1079722753" LABEL="/0"
UUID="655e3381-7157-463e-aaed-fe45a9617e79" SEC_TYPE="ext3"
TYPE="ext2">/dev/sda11</device>

<device DEVNO="0x0821" TIME="1079722753" LABEL="/2"
UUID="b0971594-1529-406a-b969-278d4f770cb3" SEC_TYPE="ext3"
TYPE="ext2">/dev/sdc1</device>

<device DEVNO="0x0831" TIME="1079722753" LABEL="/3"
UUID="46eaafeb-0fb7-4647-91e1-c9fe9a1374a7" SEC_TYPE="ext3"
TYPE="ext2">/dev/sdd1</device>

hello06.alexa.com root 169%

 

 

Thanks.

--Guolin Cheng

 

 

 

-----Original Message-----
From: Theodore Ts'o [mailto:tytso at mit.edu] 
Sent: Monday, April 19, 2004 3:28 PM
To: Guolin Cheng
Cc: Stephen C. Tweedie; ops; Fedora (E-mail); Redhat Ext3 (E-mail); Jeff
Garzik
Subject: Re: Strange Fedora Booting problem: can not mount
"LABEL=*"partitions

 

On Thu, Apr 15, 2004 at 10:50:03AM -0700, Guolin Cheng wrote:

> 

>  Thanks. But the problem got debugged&fixed, the answer was post on

> fedora-list about 2 weeks ago. 

> 

> The problem is: the /etc/blkid.tab file works as an old unappropriate

> disk partitions cache for fsck|blkid commands when stystem image is

> installed to a different arch (scsi->ide) machine, the old cache will

> mislead fsck|blkid at the first run and only the first run, since the

> first run will update /etc/blkid.tab file. 

 

What version of e2fsprogs were you testing with?  I've just tested

using the latest version of e2fsprogs, and it works just fine.  In the

test below, I corrupt /etc/blkid.tab by swapping /dev/hda and

/dev/hdc.  This might correspond with might happen after disks get

switched around.  I run fsck in debugging mode to make sure it gets

the correct devices, despite the corrupted /etc/blkid.tab file.  As

you can see, it works fine:

 

# sed -e 's/hda/hdz/g' -e 's/hdc/hda/g' -e 's/hdz/hdc/' -e
's/0x03/0xf3/' -e s'/0x16/0x03/' -e 's/0xf3/0x16/' /etc/blkid.tab  >
/tmp/blkid.tab.broken

 

# cp /tmp/blkid.tab.broken /etc/blkid.tab

 

# fsck -AVN -a

fsck 1.35 (28-Feb-2004)

Checking all file systems.

[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/hda1

[/sbin/fsck.ext3 (1) -- /usr] fsck.ext3 -a /dev/hda3

[/sbin/fsck.ext3 (1) -- /debian] fsck.ext3 -a /dev/hda5

 

# e2label /dev/hda1

root

 

# e2label /dev/hda3

usr

 

# e2label /dev/hda5

Debian

 

The only thing I can think of is that you might have been using an

older version of e2fsprogs that was buggy....

 

                                    - Ted

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20040419/bd703222/attachment.htm>

From sct at redhat.com  Tue Apr 20 10:06:28 2004
From: sct at redhat.com (Stephen C. Tweedie)
Date: 20 Apr 2004 11:06:28 +0100
Subject: [patch] Re: stalled 'sync' on ext3+quota over drbd
In-Reply-To: <1082399225.1582.11.camel@ariel.sovam.com>
References: <1080125239.4717.33.camel@ariel.sovam.com>
	<1080737188.1991.9.camel@sisko.scot.redhat.com>
	<1080738345.22942.53.camel@ariel.sovam.com>
	<1080740974.1991.28.camel@sisko.scot.redhat.com>
	<1081177587.7677.110.camel@ariel.sovam.com>
	<1081255826.22308.57.camel@ariel.sovam.com>
	<1082150363.2081.85.camel@sisko.scot.redhat.com>
	<1082198452.20346.25.camel@pccross.average.org>
	<20040419133807.GB15541@atrey.karlin.mff.cuni.cz>
	<1082385443.17175.183.camel@ariel.sovam.com>
	<1082399225.1582.11.camel@ariel.sovam.com>
Message-ID: <1082455588.2106.12.camel@sisko.scot.redhat.com>

Hi,

On Mon, 2004-04-19 at 19:27, Eugene Crosser wrote:

> > From the first impression, it did not help.
> 
> Luckily, I was wrong.  At least on the test environment I get the
> results below.  This is 2.6.5 kernel with euivalent of Stephen's patch. 
> 'setquota' sets user and group quota for 40,000 userids and 40,000
> groupids.  'mktree' writes one byte into 40,000 files and sets their
> owner and group to 40,000 unique values.
> 
> 			unpatched		patched
> sync			7m50			0m0.8

OK, that's a win. :-)

> On my 'big' system sync still runs unexpectedly long (40sec - 4min) but
> it is by far better than it was before the patch...

That's quote possibly just the raw cost of writing out a million
dquots.  The overhead of the list traversal should be under control now,
though.  A profile would help determine what the remaining cost is: is
that time spent in the CPU, or in IO wait?

Jan, mind if I push the patch to 2.4?  Your locking concern seems to be
2.6-only; on 2.4, BKL should be sufficient protection even on SMP.

Cheers,
 Stephen




From deepan_acharya at yahoo.com  Mon Apr 19 21:40:19 2004
From: deepan_acharya at yahoo.com (Deepan Acharya)
Date: Mon, 19 Apr 2004 14:40:19 -0700 (PDT)
Subject: "ext3-fs warning : ext3_block_to_path block <0 "
Message-ID: <20040419214019.71554.qmail@web41304.mail.yahoo.com>

Problem Definition:
I have the following server system Configuration.
Red Hat Linux release 7.3 (Valhalla)
Kernel 2.4.18-27.7.xsmp on an i686
There are two local disc drives on the system configured as RAID 0 
using the onboard Hardware RAID Controller. All the basic linux
volumes(partitions) i.e 

/, / root, /boot, /home, /usr, /var, /tmp, /dev/shm are mounted of the

above local hard drive RAID 0 configuration as /dev/sdb1 thro /dev/sdb8
In addition to the above i have another volume dedicated to an 
application which is mounted of a SAN storage device via a emulex fiber channel 
link which is mounted as /dev/sda1 
 
The problem is whenever the system mounts this particular volume of the  SAN , the system generates the following error messages on the console.
 
"ext3-fs warning : ext3_block_to_path block <0 "
 
This messages continues to be generated and occassionally causing the system to crash. 
The problem is such that the system cannot be logged into even from the
console. Only a powercycle of the system works.  Interestingly i have
another system with the same hardware and software configuration and 
setup which works perfectly fine.
Though i am not sure it seems to be an issue with a specific file or 
block on the mount device. However your suggestions or advice on how to go 
about and fix the above problem as soon as possible would be more than 
helpful and really appreciated.
Sincerely
Deepan.


		
---------------------------------
Do you Yahoo!?
Yahoo! Photos: High-quality 4x6 digital prints for 25?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20040419/e7bd8e23/attachment.htm>

From deepan_acharya at yahoo.com  Mon Apr 19 22:54:27 2004
From: deepan_acharya at yahoo.com (Deepan Acharya)
Date: Mon, 19 Apr 2004 15:54:27 -0700 (PDT)
Subject: Error "ext3-fs warning : ext3_block_to_path block <0 "
Message-ID: <20040419225427.94892.qmail@web41307.mail.yahoo.com>

Problem Definition:
I have the following server system Configuration.
Red Hat Linux release 7.3 (Valhalla)
Kernel 2.4.18-27.7.xsmp on an i686
There are two local scsi disc drives on the system configured as RAID 1
using the onboard Hardware RAID Controller. 
 
All the basic linux volumes(partitions) i.e

/, / root, /boot, /home, /usr, /var, /tmp, /dev/shm 
 
are mounted of the above local hard drive RAID 1 configuration as /dev/sdb1 thro /dev/sdb8.
 
In addition to the above i have another volume dedicated to an 
application which is mounted of a SAN storage device via a emulex fiber channel 
link which is mounted as /dev/sda1.

The problem is whenever the system mounts this particular volume of the 
SAN, the system generates the following error messages on the console.
 
"ext3-fs warning : ext3_block_to_path block <0 "
 
This messages continues to be generated, and is once in approx 2 months causing the system to crash. 

The problem is such that the system cannot be logged into even from the
console. Only a powercycle of the system works.  Interestingly i have
another system with the same hardware and software configuration and 
setup which works perfectly fine.
 
Though i am not sure it seems to be an issue with a specific file or 
block on the mount device. However your suggestions or advice on how to go 
about and fix the above problem as soon as possible would be more than 
helpful and really appreciated.
 
Sincerely
Deepan.

		
---------------------------------
Do you Yahoo!?
Yahoo! Photos: High-quality 4x6 digital prints for 25?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20040419/32b6e52b/attachment.htm>

From tytso at mit.edu  Tue Apr 20 15:33:28 2004
From: tytso at mit.edu (Theodore Ts'o)
Date: Tue, 20 Apr 2004 11:33:28 -0400
Subject: Strange Fedora Booting problem: can not mount
	"LABEL=*"partitions
In-Reply-To: <41089CB27BD8D24E8385C8003EDAF7ABBA48C9@karl.alexa.com>
References: <41089CB27BD8D24E8385C8003EDAF7ABBA48C9@karl.alexa.com>
Message-ID: <20040420153328.GB3441@thunk.org>

On Mon, Apr 19, 2004 at 05:03:25PM -0700, Guolin Cheng wrote:
> Hi, Theodore,
> 
>  Thanks for your tests. But I got the problem and got it FIXED by
> flushing out the contents of /etc/blkid.tab before cloned clients
> reboot.
> 
>  I'm using the e2fsprogs-1.34-1 comes with Fedora Core 1.

Can you do me a favor and please try replicating my test using the
e2fsprogs-1.34-1 with Fedora Core 1, instead of just asserting that
the problem exists?  What you are describing should not be occurring
at all; you should not need to do the workaround.  

In fact, if you need to do it, it's a bad, Bad, BAD problem in the
blkid library.  This is why I'm so interested in tracking this down.

I have not been able to replicate the problem using the most recent
version of e2fsprogs, and I would like to know the problem is still
there.  If you could try replicating the test which I did in my last
e-mail message, and the problem exists in Fedora Core 1, but does not
exist in recent versions of e2fsprogs, then I can be confident in
knowing that the problem has been fixed.  (Or was introduced in the
Fedora Core 1 changes, in which case it's not my problem.  :-)

						- Ted




From jack at ucw.cz  Tue Apr 20 17:57:32 2004
From: jack at ucw.cz (Jan Kara)
Date: Tue, 20 Apr 2004 19:57:32 +0200
Subject: [patch] Re: stalled 'sync' on ext3+quota over drbd
In-Reply-To: <1082455588.2106.12.camel@sisko.scot.redhat.com>
References: <1080738345.22942.53.camel@ariel.sovam.com>
	<1080740974.1991.28.camel@sisko.scot.redhat.com>
	<1081177587.7677.110.camel@ariel.sovam.com>
	<1081255826.22308.57.camel@ariel.sovam.com>
	<1082150363.2081.85.camel@sisko.scot.redhat.com>
	<1082198452.20346.25.camel@pccross.average.org>
	<20040419133807.GB15541@atrey.karlin.mff.cuni.cz>
	<1082385443.17175.183.camel@ariel.sovam.com>
	<1082399225.1582.11.camel@ariel.sovam.com>
	<1082455588.2106.12.camel@sisko.scot.redhat.com>
Message-ID: <20040420175732.GA16106@atrey.karlin.mff.cuni.cz>

  Hi,

> On Mon, 2004-04-19 at 19:27, Eugene Crosser wrote:
> 
> > > From the first impression, it did not help.
> > 
> > Luckily, I was wrong.  At least on the test environment I get the
> > results below.  This is 2.6.5 kernel with euivalent of Stephen's patch. 
> > 'setquota' sets user and group quota for 40,000 userids and 40,000
> > groupids.  'mktree' writes one byte into 40,000 files and sets their
> > owner and group to 40,000 unique values.
> > 
> > 			unpatched		patched
> > sync			7m50			0m0.8
> 
> OK, that's a win. :-)
> 
> > On my 'big' system sync still runs unexpectedly long (40sec - 4min) but
> > it is by far better than it was before the patch...
> 
> That's quote possibly just the raw cost of writing out a million
> dquots.  The overhead of the list traversal should be under control now,
> though.  A profile would help determine what the remaining cost is: is
> that time spent in the CPU, or in IO wait?
> 
> Jan, mind if I push the patch to 2.4?  Your locking concern seems to be
> 2.6-only; on 2.4, BKL should be sufficient protection even on SMP.
  For 2.4 I think your patch is the best solution. Please push it. For
2.6 I've written a patch which implements per-sb lists of dirty dquots
which should also fix the problem and I think it's nicer. My patch needs
a bit more testing so I'll submit it later this week. Thanks for help.

								Honza




From d_baron at 012.net.il  Tue Apr 20 19:09:23 2004
From: d_baron at 012.net.il (David Baron)
Date: Tue, 20 Apr 2004 22:09:23 +0300
Subject: Periodic fsck's fail but all is OK.
Message-ID: <200404202209.23596.d_baron@012.net.il>

Every 24 mounts, fsck runs. Often enough, this run will fail demanding a 
manual run. This -f manual runs goes through its stages and only on rare 
occasion finds an orphaned node. 99%, finds noting amiss.

Any reason for these "failures"?




From adilger at clusterfs.com  Tue Apr 20 16:15:40 2004
From: adilger at clusterfs.com (Andreas Dilger)
Date: Tue, 20 Apr 2004 10:15:40 -0600
Subject: "ext3-fs warning : ext3_block_to_path block <0 "
In-Reply-To: <20040419214019.71554.qmail@web41304.mail.yahoo.com>
References: <20040419214019.71554.qmail@web41304.mail.yahoo.com>
Message-ID: <20040420161540.GS12357@schnapps.adilger.int>

On Apr 19, 2004  14:40 -0700, Deepan Acharya wrote:
> The problem is whenever the system mounts this particular volume of the  SAN , the system generates the following error messages on the console.
>  
> "ext3-fs warning : ext3_block_to_path block <0 "

Run "e2fsck -f" on this device.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/




From sct at redhat.com  Tue Apr 20 22:27:57 2004
From: sct at redhat.com (Stephen C. Tweedie)
Date: 20 Apr 2004 23:27:57 +0100
Subject: [patch] Re: stalled 'sync' on ext3+quota over drbd
In-Reply-To: <20040420175732.GA16106@atrey.karlin.mff.cuni.cz>
References: <1080738345.22942.53.camel@ariel.sovam.com>
	<1080740974.1991.28.camel@sisko.scot.redhat.com>
	<1081177587.7677.110.camel@ariel.sovam.com>
	<1081255826.22308.57.camel@ariel.sovam.com>
	<1082150363.2081.85.camel@sisko.scot.redhat.com>
	<1082198452.20346.25.camel@pccross.average.org>
	<20040419133807.GB15541@atrey.karlin.mff.cuni.cz>
	<1082385443.17175.183.camel@ariel.sovam.com>
	<1082399225.1582.11.camel@ariel.sovam.com>
	<1082455588.2106.12.camel@sisko.scot.redhat.com>
	<20040420175732.GA16106@atrey.karlin.mff.cuni.cz>
Message-ID: <1082500077.2106.17.camel@sisko.scot.redhat.com>

Hi,

On Tue, 2004-04-20 at 18:57, Jan Kara wrote:

>   For 2.4 I think your patch is the best solution. Please push it. For
> 2.6 I've written a patch which implements per-sb lists of dirty dquots

That's only part of the problem: a million dquots on a single sb will
still show this performance degradation currently.  Rotating the list
head after each progress will make sure that even if we are processing
unnecessary dquots, we at least do so only once per sync (which is way
better than 0.5*N^2 times!)

But I'll push the bits to Marcelo, thanks.

Cheers,
 Stephen




From vijayan at cs.wisc.edu  Wed Apr 21 04:56:06 2004
From: vijayan at cs.wisc.edu (Vijayan Prabhakaran)
Date: Tue, 20 Apr 2004 23:56:06 -0500 (CDT)
Subject: Separate common journal device
Message-ID: <Pine.LNX.4.58.0404202353570.28409@frylock.cs.wisc.edu>


Hi,

Is it possible to use a separate journal device (one on a separate
drive or a partition) shared among more than 1 Ext3 file systems ?

I appreciate any inputs.

thanks,
Vijayan




From adilger at clusterfs.com  Wed Apr 21 09:23:01 2004
From: adilger at clusterfs.com (Andreas Dilger)
Date: Wed, 21 Apr 2004 03:23:01 -0600
Subject: Separate common journal device
In-Reply-To: <Pine.LNX.4.58.0404202353570.28409@frylock.cs.wisc.edu>
References: <Pine.LNX.4.58.0404202353570.28409@frylock.cs.wisc.edu>
Message-ID: <20040421092301.GD2938@schnapps.adilger.int>

On Apr 20, 2004  23:56 -0500, Vijayan Prabhakaran wrote:
> Is it possible to use a separate journal device (one on a separate
> drive or a partition) shared among more than 1 Ext3 file systems ?

It is possible now to use an external block device for a single filesystem.
The on-disk format is designed to allow multiple filesystems to share the
same device, but that has never been fully implemented.

At one point I had implemented a patch to mount a "jbd" filesystem with the
journal device as the first step of having a shared journal device.  Having
the "jbd" device in /etc/fstab (before filesystems that use it) allows e2fsck
to do journal replay on all of the filesystems before the journal starts to
be used, or alternately dumps the journal data to an external file for later
replay (e.g. if block devices are not available when e2fsck is run on the
jbd device).  It also allows the jbd code to configure the in-core code to
be ready for external filesystems to connect to it.  Finally, it also marks
the block device as in-use so it is less likely that it will be overwritten
accidentally.

See the following email for the (ancient) patch.  Most of the comments
and a large fraction of the code in that email are still relevant, with
the exception that all of the UUID handling already exists as libblkid
in e2fsprogs, and it doesn't say what kernel version this is for (I'd
suspect 2.3, but I'm not totally sure.  Sadly, nobody commented on it
at the time and it was lost in the mists of antiquity.

> Subject: [PATCH][RFC] mountable journal devices
> To: Ext2 development mailing list <ext2-devel at lists.sourceforge.net>
> Date: Wed, 8 Aug 2001 02:08:23 -0600 (MDT)
http://marc.theaimsgroup.com/?l=ext2-devel&m=99725819513803

And the thread starting at discusses shared external journal devices:
https://listman.redhat.com/archives/ext3-users/2001-November/msg00182.html

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/




From jack at ucw.cz  Wed Apr 21 10:29:21 2004
From: jack at ucw.cz (Jan Kara)
Date: Wed, 21 Apr 2004 12:29:21 +0200
Subject: [patch] Re: stalled 'sync' on ext3+quota over drbd
In-Reply-To: <1082500077.2106.17.camel@sisko.scot.redhat.com>
References: <1081177587.7677.110.camel@ariel.sovam.com>
	<1081255826.22308.57.camel@ariel.sovam.com>
	<1082150363.2081.85.camel@sisko.scot.redhat.com>
	<1082198452.20346.25.camel@pccross.average.org>
	<20040419133807.GB15541@atrey.karlin.mff.cuni.cz>
	<1082385443.17175.183.camel@ariel.sovam.com>
	<1082399225.1582.11.camel@ariel.sovam.com>
	<1082455588.2106.12.camel@sisko.scot.redhat.com>
	<20040420175732.GA16106@atrey.karlin.mff.cuni.cz>
	<1082500077.2106.17.camel@sisko.scot.redhat.com>
Message-ID: <20040421102921.GD12671@atrey.karlin.mff.cuni.cz>

  Hi,

> On Tue, 2004-04-20 at 18:57, Jan Kara wrote:
> 
> >   For 2.4 I think your patch is the best solution. Please push it. For
> > 2.6 I've written a patch which implements per-sb lists of dirty dquots
> 
> That's only part of the problem: a million dquots on a single sb will
> still show this performance degradation currently.  Rotating the list
> head after each progress will make sure that even if we are processing
> unnecessary dquots, we at least do so only once per sync (which is way
> better than 0.5*N^2 times!)
  If I have dirty dquot list for sb+type (which is what I actually have)
then I just always get first entry from the list, sync it, delete it and
continue.. This way I will have the same asymptotic performance as with
moving on inuse_list (ie. O(N)), no? And I will get some constant bonus
for smaller list... Or do I miss something?

								Honza




From sct at redhat.com  Wed Apr 21 20:56:40 2004
From: sct at redhat.com (Stephen C. Tweedie)
Date: 21 Apr 2004 21:56:40 +0100
Subject: Periodic fsck's fail but all is OK.
In-Reply-To: <200404202209.23596.d_baron@012.net.il>
References: <200404202209.23596.d_baron@012.net.il>
Message-ID: <1082580999.2060.31.camel@sisko.scot.redhat.com>

Hi,

On Tue, 2004-04-20 at 20:09, David Baron wrote:
> Every 24 mounts, fsck runs. Often enough, this run will fail demanding a 
> manual run. This -f manual runs goes through its stages and only on rare 
> occasion finds an orphaned node. 99%, finds noting amiss.

Orphan inodes are normal behaviour if you open a file, unlink it, and
then reboot while it's still open.  There are lots of reasons why that
can happen --- upgrading a library that's still in use by running
processes is the one that I see causing it most often, for example.

If fsck finds an orphan, it just means that you've fscked a filesystem
that hasn't yet been mounted by the kernel since the reboot.  Both fsck
and the kernel will clean up orphan inodes as soon as they see them. 
It's nothing to worry about.

Cheers, 
 Stephen





From cchan at outblaze.com  Fri Apr 23 01:52:01 2004
From: cchan at outblaze.com (Christopher Chan)
Date: Fri, 23 Apr 2004 09:52:01 +0800
Subject: 2.6.5 and latest Fedora Core 1 kernels cannot handle files over 2.x
 GB?
Message-ID: <408876C1.80601@outblaze.com>

A mysql database file was copied over to a new box running Fedora Core 1.

The kernel was updated to the latest Fedora release.

However mysqld complains about corrupted tables.

The kernel was then updated to 2.6.5

mysqld still complains about corrupted tables.

Hardware:

Dual PIII 800.
3ware RAID

dmesg:

...
...
...
EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in 
datazone - block = 1634890784, count = 1
EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in 
datazone - block = 1667330926, count = 1
EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in 
datazone - block = 1852795252, count = 1
EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in 
datazone - block = 1752637555, count = 1
ext3_reserve_inode_write: aborting transaction: Journal has aborted in 
__ext3_journal_get_write_access<2>EXT3-fs error (device sdb1) in 
ext3_reserve_inode_write: Journal has aborted
ext3_reserve_inode_write: aborting transaction: Journal has aborted in 
__ext3_journal_get_write_access<2>EXT3-fs error (device sdb1) in 
ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device sdb1) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device sdb1) in ext3_truncate: Journal has aborted
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
ext3_abort called.
EXT3-fs abort (device sdb1): ext3_journal_start: Detected aborted journal
Remounting filesystem read-only
EXT3-fs error (device sdb1) in start_transaction: Journal has aborted




From sct at redhat.com  Fri Apr 23 11:23:04 2004
From: sct at redhat.com (Stephen C. Tweedie)
Date: 23 Apr 2004 12:23:04 +0100
Subject: 2.6.5 and latest Fedora Core 1 kernels cannot handle files
	over 2.x GB?
In-Reply-To: <408876C1.80601@outblaze.com>
References: <408876C1.80601@outblaze.com>
Message-ID: <1082719383.2100.9.camel@sisko.scot.redhat.com>

Hi,

Kernels since 2.4 have all been quite happy with files over 2GB.  There
were even patches for large file support on some 2.2 kernels at one
point, though those never got merged upstream.  I doubt that's the
problem, especially since:

On Fri, 2004-04-23 at 02:52, Christopher Chan wrote:
> A mysql database file was copied over to a new box running Fedora Core 1.
> The kernel was updated to the latest Fedora release.
> However mysqld complains about corrupted tables.

> The kernel was then updated to 2.6.5
> mysqld still complains about corrupted tables.

> EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in 
> datazone - block = 1634890784, count = 1

Your filesystem is corrupt.  You need to run e2fsck to fix it up, and
check the files against a backup.  

There's not enough information here to begin to diagnose _why_ they are
corrupt, but on 2.4 systems it's bad hardware 99% of the time. 
"memtest86" is usually a good place to start.

Cheers,
 Stephen





From cchan at outblaze.com  Fri Apr 23 15:59:54 2004
From: cchan at outblaze.com (Christopher Chan)
Date: Fri, 23 Apr 2004 23:59:54 +0800
Subject: 2.6.5 and latest Fedora Core 1 kernels cannot handle files	over
 2.x GB?
In-Reply-To: <1082719383.2100.9.camel@sisko.scot.redhat.com>
References: <408876C1.80601@outblaze.com>
	<1082719383.2100.9.camel@sisko.scot.redhat.com>
Message-ID: <40893D7A.4070706@outblaze.com>

> 
> Your filesystem is corrupt.  You need to run e2fsck to fix it up, and
> check the files against a backup.  
> 
> There's not enough information here to begin to diagnose _why_ they are
> corrupt, but on 2.4 systems it's bad hardware 99% of the time. 
> "memtest86" is usually a good place to start.

Thanks.

I got the same problem under 2.6.5...same goes I guess? We had no 
problems with reiserfs...

Christopher




From sct at redhat.com  Fri Apr 23 16:48:49 2004
From: sct at redhat.com (Stephen C. Tweedie)
Date: 23 Apr 2004 17:48:49 +0100
Subject: 2.6.5 and latest Fedora Core 1 kernels cannot handle
	files	over 2.x GB?
In-Reply-To: <40893D7A.4070706@outblaze.com>
References: <408876C1.80601@outblaze.com>
	<1082719383.2100.9.camel@sisko.scot.redhat.com>
	<40893D7A.4070706@outblaze.com>
Message-ID: <1082738929.2100.23.camel@sisko.scot.redhat.com>

Hi,

On Fri, 2004-04-23 at 16:59, Christopher Chan wrote:

> > Your filesystem is corrupt.  You need to run e2fsck to fix it up, and
> > check the files against a backup.  
> > 
> > There's not enough information here to begin to diagnose _why_ they are
> > corrupt, but on 2.4 systems it's bad hardware 99% of the time. 
> > "memtest86" is usually a good place to start.
> 
> Thanks.
> 
> I got the same problem under 2.6.5...same goes I guess? We had no 
> problems with reiserfs...

It could be just a one-off event that corrupted it; it could be a memory
fault that developed after you switched; it could be timing-related. 
fsck is definitely the first thing to do; memtest86 is always a useful
next step if there's any suspicion about memory corruption.

--Stephen




From sct at redhat.com  Fri Apr 23 17:06:17 2004
From: sct at redhat.com (Stephen C. Tweedie)
Date: 23 Apr 2004 18:06:17 +0100
Subject: [patch] Re: stalled 'sync' on ext3+quota over drbd
In-Reply-To: <20040421102921.GD12671@atrey.karlin.mff.cuni.cz>
References: <1081177587.7677.110.camel@ariel.sovam.com>
	<1081255826.22308.57.camel@ariel.sovam.com>
	<1082150363.2081.85.camel@sisko.scot.redhat.com>
	<1082198452.20346.25.camel@pccross.average.org>
	<20040419133807.GB15541@atrey.karlin.mff.cuni.cz>
	<1082385443.17175.183.camel@ariel.sovam.com>
	<1082399225.1582.11.camel@ariel.sovam.com>
	<1082455588.2106.12.camel@sisko.scot.redhat.com>
	<20040420175732.GA16106@atrey.karlin.mff.cuni.cz>
	<1082500077.2106.17.camel@sisko.scot.redhat.com>
	<20040421102921.GD12671@atrey.karlin.mff.cuni.cz>
Message-ID: <1082739977.2100.26.camel@sisko.scot.redhat.com>

Hi,

On Wed, 2004-04-21 at 11:29, Jan Kara wrote:

> > That's only part of the problem: a million dquots on a single sb will
> > still show this performance degradation currently.  Rotating the list
> > head after each progress will make sure that even if we are processing
> > unnecessary dquots, we at least do so only once per sync (which is way
> > better than 0.5*N^2 times!)

>   If I have dirty dquot list for sb+type (which is what I actually have)
> then I just always get first entry from the list, sync it, delete it and
> continue.. This way I will have the same asymptotic performance as with
> moving on inuse_list (ie. O(N)), no? And I will get some constant bonus
> for smaller list... Or do I miss something?

Never mind me, I was reading your email as suggesting per-sb quota
lists, not per-sb *dirty* quota lists.  With new lists specifically for
dirty dquots the problem clearly goes away, yes.

Cheers,
 Stephen

> 
> 								Honza
-- 
Stephen C. Tweedie <sct at redhat.com>




From cchan at outblaze.com  Sat Apr 24 00:49:58 2004
From: cchan at outblaze.com (Christopher Chan)
Date: Sat, 24 Apr 2004 08:49:58 +0800
Subject: 2.6.5 and latest Fedora Core 1 kernels cannot handle	files	over
 2.x GB?
In-Reply-To: <1082738929.2100.23.camel@sisko.scot.redhat.com>
References: <408876C1.80601@outblaze.com>	
	<1082719383.2100.9.camel@sisko.scot.redhat.com>	
	<40893D7A.4070706@outblaze.com>
	<1082738929.2100.23.camel@sisko.scot.redhat.com>
Message-ID: <4089B9B6.2000706@outblaze.com>

Stephen C. Tweedie wrote:
> Hi,
> 
> On Fri, 2004-04-23 at 16:59, Christopher Chan wrote:
> 
> 
>>>Your filesystem is corrupt.  You need to run e2fsck to fix it up, and
>>>check the files against a backup.  
>>>
>>>There's not enough information here to begin to diagnose _why_ they are
>>>corrupt, but on 2.4 systems it's bad hardware 99% of the time. 
>>>"memtest86" is usually a good place to start.
>>
>>Thanks.
>>
>>I got the same problem under 2.6.5...same goes I guess? We had no 
>>problems with reiserfs...
> 
> 
> It could be just a one-off event that corrupted it; it could be a memory
> fault that developed after you switched; it could be timing-related. 
> fsck is definitely the first thing to do; memtest86 is always a useful
> next step if there's any suspicion about memory corruption.

This should take the mystery away then. We were all stumped as to why we 
had such a problem at all.

We even created a 12G file on the reiserfs. Mounted that file on 
loopback and then created a ext3 filesystem there. We tryed creating a 
large file on that loopback mounted ext3 fs but it would fail when we 
got to around 2.6G.

However we could not replicate this on another box running the same 
Fedora and kernels and with or without a 3ware card.

Thank you for explaining the possible causes.

Christopher




From borise at comcast.net  Fri Apr 23 16:27:11 2004
From: borise at comcast.net (Boris Erl)
Date: Fri, 23 Apr 2004 09:27:11 -0700
Subject: processing writes requests in data=journal,sync mode
Message-ID: <ABEFJEOMBLKCDALICJBNMEMDDAAA.borise@comcast.net>

Hi,

We are currently doing SpecSFS comparison benchmarking to evaluate
advantages of FS journaling to NVRAM card versus journaling to hard disks.
We compare NFS performance for Linux file server with ext3 file system in
?data=journal? mode for three different locations of the file system
journal:
   - inside main file system
   - on a dedicated HD
   - on an NVRAM PCI card
The file system is mounted and exported in ?sync? mode.

We assume that ext3 file system works in the following way:
  - synchronously writes data to file system journal
  - acknowledges to the client that NFS operation is completed
  - at some time asynchronously writes data to the main file system

Could somebody confirm or correct our assumption above?

Sincerely,

 Boris Erlikhman
 borise at comcast.net




From adilger at clusterfs.com  Mon Apr 26 15:33:52 2004
From: adilger at clusterfs.com (Andreas Dilger)
Date: Mon, 26 Apr 2004 09:33:52 -0600
Subject: processing writes requests in data=journal,sync mode
In-Reply-To: <ABEFJEOMBLKCDALICJBNMEMDDAAA.borise@comcast.net>
References: <ABEFJEOMBLKCDALICJBNMEMDDAAA.borise@comcast.net>
Message-ID: <20040426153352.GV2938@schnapps.adilger.int>

On Apr 23, 2004  09:27 -0700, Boris Erl wrote:
> We are currently doing SpecSFS comparison benchmarking to evaluate
> advantages of FS journaling to NVRAM card versus journaling to hard disks.
> We compare NFS performance for Linux file server with ext3 file system in
> ?data=journal? mode for three different locations of the file system
> journal:
>    - inside main file system
>    - on a dedicated HD
>    - on an NVRAM PCI card
> The file system is mounted and exported in ?sync? mode.
> 
> We assume that ext3 file system works in the following way:
>   - synchronously writes data to file system journal
>   - acknowledges to the client that NFS operation is completed
>   - at some time asynchronously writes data to the main file system
> 
> Could somebody confirm or correct our assumption above?

This is almost correct.  For ext3 with sync writers, the data is written to
the journal asynchronously and before the journal is synced the writing
thread reschedules so that any other running threads can also do asynchronous
writes to the journal.  When no more running threads want to write to the
journal then it is finally synced.  Of course, none of the threads doing
sync writes return until after the journal has been synced.

Doing it this way can batch multiple sync writes to the journal instead of
each writer starting and syncing a transaction, and improves performance
if there are multiple writers (usually the case for busy NFS servers).

I believe previous testing found very little difference between NVRAM and
dedicated HD journals, because the IO to the journal is basically linear
writes (no reads or seeks) unless the journal is being recovered.  Having
a larger journal (mke2fs -J size=400) can make a noticable difference if
there are lots of clients writing becuase if the journal gets too full you
have to wait for it to flush before resuming transactions.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/




From adilger at clusterfs.com  Thu Apr 29 17:09:22 2004
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 29 Apr 2004 11:09:22 -0600
Subject: Ext3 problems (aborting journal).
In-Reply-To: <200404291415.46949.ender@debian.org>
References: <200404291415.46949.ender@debian.org>
Message-ID: <20040429170922.GN1521@schnapps.adilger.int>

On Apr 29, 2004  14:15 +0200, David Mart?nez Moreno wrote:
> Hello all. I'm writing to all the people in charge of ext3 fs
> 
> Apr 29 12:21:21 arsinoe kernel: EXT3-fs error (device sda7): ext3_free_blocks: Freeing blocks not in datazone - block = 1071716394, count = 1

You need to run "e2fsck -f /dev/sda7" on the unmounted filesystem.  There
is some sort of corruption there.

> Apr 23 20:35:41 arsinoe kernel: EXT3-fs error (device sda7): ext3_free_blocks: Freeing blocks not in datazone - block = 1075532092, count = 1

This earlier error should have forced a full fsck - did that run?

> Apr 23 20:38:47 arsinoe kernel: i91u: Reset SCSI Bus ...
> Apr 23 20:38:47 arsinoe kernel: ERROR: SCSI host `INI9100U' has no error handling
> Apr 23 20:38:47 arsinoe kernel: ERROR: This is not a safe way to run your SCSI host
> Apr 23 20:38:47 arsinoe kernel: ERROR: The error handling must be added to this driver

This seems a bit ominous, not sure how bad it really is.

> 	I forced to fsck all the ext3 drives (/dev/sda{1,6,7}) and installed 2.6.6-rc2.
> It fsck'ed one of the partitions, then wanted to reboot, then fsck'ed the three

Hmm, so it did run.  It seems you are getting corruption on the disk for
some reason.

> 	A tune2fs from the affected partition:
> 
> arsinoe:/usr/src/dev# tune2fs -l /dev/sda7
> tune2fs 1.35-WIP (07-Dec-2003)
> Filesystem volume name:   <none>
> Last mounted on:          <not available>
> Filesystem UUID:          6b9d38e7-7487-444b-b8e4-68404673964f
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal filetype needs_recovery sparse_super
> Default mount options:    (none)
> Filesystem state:         clean with errors

Was this after the e2fsck was run?  It shouldn't be marked "with errors".

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/




From lists-ext3-users at bruce-guenter.dyndns.org  Thu Apr 29 19:10:33 2004
From: lists-ext3-users at bruce-guenter.dyndns.org (Bruce Guenter)
Date: Thu, 29 Apr 2004 13:10:33 -0600
Subject: Transaction ordering
Message-ID: <20040429191033.GA4675@em.ca>

Greetings.

If I issue the following sequence of pseudo-syscalls:

	fd = open(temp_file, O_WRONLY)
	write(fd)
	rename(temp_file, dest_file)
	fsync(fd)
	close(fd)
(where dest_file is in a different directory)

Does ext3 order the commit such that the file write effectively happens
in the journal before the rename?  That is, is there any chance that, if
a crash occurred, that the destination directory would contain a link to
an incompletely written file?

Thanks.
-- 
Bruce Guenter <bruceg at em.ca> http://em.ca/~bruceg/ http://untroubled.org/
OpenPGP key: 699980E8 / D0B7 C8DD 365D A395 29DA  2E2A E96F B2DC 6999 80E8
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20040429/e4b0b608/attachment.sig>

From adilger at clusterfs.com  Thu Apr 29 19:53:45 2004
From: adilger at clusterfs.com (Andreas Dilger)
Date: Thu, 29 Apr 2004 13:53:45 -0600
Subject: Transaction ordering
In-Reply-To: <20040429191033.GA4675@em.ca>
References: <20040429191033.GA4675@em.ca>
Message-ID: <20040429195345.GS1521@schnapps.adilger.int>

On Apr 29, 2004  13:10 -0600, Bruce Guenter wrote:
> If I issue the following sequence of pseudo-syscalls:
> 
> 	fd = open(temp_file, O_WRONLY)
> 	write(fd)
> 	rename(temp_file, dest_file)
> 	fsync(fd)
> 	close(fd)
> (where dest_file is in a different directory)
> 
> Does ext3 order the commit such that the file write effectively happens
> in the journal before the rename?  That is, is there any chance that, if
> a crash occurred, that the destination directory would contain a link to
> an incompletely written file?

If you require such ordering, put the fsync before the rename.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/




From guolin at alexa.com  Fri Apr 30 17:54:59 2004
From: guolin at alexa.com (Guolin Cheng)
Date: Fri, 30 Apr 2004 10:54:59 -0700
Subject: disk problems or false alarm??
Message-ID: <41089CB27BD8D24E8385C8003EDAF7ABBA4914@karl.alexa.com>

Hi, 

 

 I run hundreds of Redhat 8.0 boxes and Fedora Core 1 boxes, both
Operation systems boxes give me some trouble reporting disk errors like
the following (collected from /var/log/messages of each linux boxes by
my own script). And a "badblocks" command on some of the related hard
drive reports that failed sectors found, while others reports no,
false-positive. Any one can give me suggestions or hints?

 

Thanks a lot.

 

......

Host:       arc242

arc242:     Apr 29 13:51:32 arc242 kernel: hdb: dma_intr: status=0x51 {
DriveReady SeekComplete Error }

arc242:     Apr 29 13:51:32 arc242 kernel: hdb: dma_intr: error=0x01 {
AddrMarkNotFound }, LBAsect=38613129, sector=38613064

arc242:     Apr 29 13:51:36 arc242 kernel: hdb: dma_intr: status=0x51 {
DriveReady SeekComplete Error }

arc242:     Apr 29 13:51:36 arc242 kernel: hdb: dma_intr: error=0x01 {
AddrMarkNotFound }, LBAsect=38613129, sector=38613064

arc242:     Apr 29 13:51:43 arc242 kernel: hdb: dma_intr: status=0x51 {
DriveReady SeekComplete Error }

arc242:     Apr 29 13:51:43 arc242 kernel: hdb: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=38613129, sector=38613064

arc242:     Apr 29 13:51:43 arc242 kernel: end_request: I/O error, dev
03:41 (hdb), sector 38613064

arc242:     Apr 29 13:51:49 arc242 kernel: hdb: dma_intr: status=0x51 {
DriveReady SeekComplete Error }

arc242:     Apr 29 13:51:49 arc242 kernel: hdb: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=38613129, sector=38613064

arc242:     Apr 29 13:51:49 arc242 kernel: end_request: I/O error, dev
03:41 (hdb), sector 38613064

 

Host:       arc292

arc292:     Apr 29 04:02:27 arc292 kernel: hda: dma_intr: status=0x51 {
DriveReady SeekComplete Error }

arc292:     Apr 29 04:02:27 arc292 kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=239379157, high=14, low=4498133,
sector=331888

arc292:     Apr 29 04:02:27 arc292 kernel: end_request: I/O error, dev
03:0b (hda), sector 331888

arc292:     Apr 29 04:02:29 arc292 kernel: hda: dma_intr: status=0x51 {
DriveReady SeekComplete Error }

arc292:     Apr 29 04:02:29 arc292 kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=239379157, high=14, low=4498133,
sector=331888

arc292:     Apr 29 04:02:29 arc292 kernel: end_request: I/O error, dev
03:0b (hda), sector 331888

 

...... blahblah...

 

I tried to run "badblocks" on the boxes to test whether there are real
hardware problems, then I got some of them really reports problems, and
some of them NOT. Anyone know why?

 

 

[root at arc242 root]# badblocks -s -v -n -b 512 -c 4096 /dev/hdb 38620000
38600000

Checking for bad blocks in non-destructive read-write mode

>From block 38600000 to 38620000

Checking for bad blocks (non-destructive read-write test)

Testing with random pattern: done                        

Pass completed, 0 bad blocks found.

[root at arc242 root]#

 

[root at arc292 root]#  badblocks -s -v -n -b 512 -c 4096 /dev/hda
239400000 239300000

Checking for bad blocks in non-destructive read-write mode

>From block 239300000 to 239400000

Checking for bad blocks (non-destructive read-write test)

Testing with random pattern: 239379104/239400000

239379105

done                        

Pass completed, 2 bad blocks found.

[root at arc292 root]# 

[root at arc292 root]#

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20040430/ba15ce69/attachment.htm>