From zach.brown at oracle.com  Wed Jan  2 20:42:19 2008
From: zach.brown at oracle.com (Zach Brown)
Date: Wed, 02 Jan 2008 12:42:19 -0800
Subject: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66)
In-Reply-To: <200712242302.lBON2O8s011190@agora.fsl.cs.sunysb.edu>
References: <200712242302.lBON2O8s011190@agora.fsl.cs.sunysb.edu>
Message-ID: <477BF72B.4000608@oracle.com>

Erez Zadok wrote:
> Setting: ltp-full-20071031, dio01 test on ext3 with Linus's latest tree.
> Kernel w/ SMP, preemption, and lockdep configured.

This is a real lock ordering problem.  Thanks for reporting it.

The updating of atime inside sys_mmap() orders the mmap_sem in the vfs
outside of the journal handle in ext3's inode dirtying:

> -> #1 (jbd_handle){--..}:
>        [<c023068f>] __lock_acquire+0x9cc/0xb95
>        [<c0230c2f>] lock_acquire+0x5f/0x78
>        [<c029c7e9>] journal_start+0xee/0xf8
>        [<c02959d6>] ext3_journal_start_sb+0x48/0x4a
>        [<c029152b>] ext3_dirty_inode+0x27/0x6c
>        [<c026f701>] __mark_inode_dirty+0x29/0x144
>        [<c026817b>] touch_atime+0xb7/0xbc
>        [<c023af6d>] generic_file_mmap+0x2d/0x42
>        [<c024a5cc>] mmap_region+0x1e6/0x3b4
>        [<c024aa6b>] do_mmap_pgoff+0x1fb/0x253
>        [<c02067af>] sys_mmap2+0x9b/0xb5
>        [<c020275e>] syscall_call+0x7/0xb
>        [<ffffffff>] 0xffffffff

ext3_direct_IO() orders the journal handle outside of the mmap_sem that
dio_get_page() acquires to pin pages with get_user_pages():

> -> #0 (&mm->mmap_sem){----}:
>        [<c023057f>] __lock_acquire+0x8bc/0xb95
>        [<c0230c2f>] lock_acquire+0x5f/0x78
>        [<c0397d4f>] down_read+0x3a/0x4c
>        [<c02778a2>] dio_get_page+0x4e/0x15d
>        [<c0278352>] __blockdev_direct_IO+0x431/0xa81
>        [<c0291318>] ext3_direct_IO+0x10c/0x1a1
>        [<c023c091>] generic_file_direct_IO+0x124/0x139
>        [<c023c0fc>] generic_file_direct_write+0x56/0x11c
>        [<c023ca15>] __generic_file_aio_write_nolock+0x33d/0x489
>        [<c023cbb9>] generic_file_aio_write+0x58/0xb6
>        [<c028d4e3>] ext3_file_write+0x27/0x99
>        [<c0256d0f>] do_sync_write+0xc5/0x102
>        [<c0257463>] vfs_write+0x90/0x119
>        [<c0257a25>] sys_write+0x3d/0x61
>        [<c02026d6>] sysenter_past_esp+0x5f/0xa5
>        [<ffffffff>] 0xffffffff

Two fixes come to mind:

1) use something like Peter's ->mmap_prepare() to update atime before
acquiring the mmap_sem.  ( http://lkml.org/lkml/2007/11/11/97 ).  I
don't know if this would leave more paths which do a journal_start()
while holding the mmap_sem.

2) rework ext3's dio to only hold the jbd handle in ext3_get_block().
Chris has a patch for this kicking around somewhere but I'm told it has
problems exposing old blocks in ordered data mode.

Does anyone have preferences?  I could go either way.  I certainly don't
like the idea of journal handles being held across the entirety of
fs/direct-io.c.  It's yet another case of O_DIRECT differing wildly from
the buffered path :(.

- z



From fasihullah.askiri at gmail.com  Thu Jan  3 10:30:22 2008
From: fasihullah.askiri at gmail.com (Fasihullah Askiri)
Date: Thu, 3 Jan 2008 16:00:22 +0530
Subject: read() on a deleted file
Message-ID: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com>

Hi all

I have a doubt regarding the behaviour of read() on an ext3
filesystem. To elucidate my doubts, I wrote a small program opens a
file and reads one byte at a time and sleeps for a while. I deleted
the file while the read was still in progress and I noticed that the
read still succeeds. How does this work? Does the kernel not free the
inode when the file is deleted but there is a pending read? To check
this, instead of deleting, I tried shred-ding the file, the read still
gets the correct data.

My questions:
- Where does the kernel get the data from?
- Is this a documented feature which I can use?
- Does shred write the file inode with junk?

Thanks for your patience
-- 
Keep Running.... And Relish the run...
+Fasih



From alex at alex.org.uk  Thu Jan  3 10:49:24 2008
From: alex at alex.org.uk (Alex Bligh)
Date: Thu, 03 Jan 2008 10:49:24 +0000
Subject: read() on a deleted file
In-Reply-To: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com>
References: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com>
Message-ID: <EE31F3345DF775C51B6DB1AB@Ximines.local>



--On 3 January 2008 16:00:22 +0530 Fasihullah Askiri 
<fasihullah.askiri at gmail.com> wrote:

> I have a doubt regarding the behaviour of read() on an ext3
> filesystem. To elucidate my doubts, I wrote a small program opens a
> file and reads one byte at a time and sleeps for a while. I deleted
> the file while the read was still in progress and I noticed that the
> read still succeeds. How does this work? Does the kernel not free the
> inode when the file is deleted but there is a pending read? To check
> this, instead of deleting, I tried shred-ding the file, the read still
> gets the correct data.

That's standard UNIX behaviour. The file exists on disk until all
references to it have disappeared (references including the open
file handle). All you do by typing "rm" is delete a reference/link to
it from a particular directory, not (necessarily) delete the file.
That's why the system call is called "unlink".

Alex



From fasihullah.askiri at gmail.com  Thu Jan  3 11:12:40 2008
From: fasihullah.askiri at gmail.com (Fasihullah Askiri)
Date: Thu, 3 Jan 2008 16:42:40 +0530
Subject: read() on a deleted file
In-Reply-To: <EE31F3345DF775C51B6DB1AB@Ximines.local>
References: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com>
	<EE31F3345DF775C51B6DB1AB@Ximines.local>
Message-ID: <80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com>

Thanx for the response. That is why I tried shred-ding the file. I
believe that shred overwrites the file inode, if so, shred should have
led to failures of read() which is not the case. How does that happen?


On Jan 3, 2008 4:19 PM, Alex Bligh <alex at alex.org.uk> wrote:
>
>
> --On 3 January 2008 16:00:22 +0530 Fasihullah Askiri
> <fasihullah.askiri at gmail.com> wrote:
>
> > I have a doubt regarding the behaviour of read() on an ext3
> > filesystem. To elucidate my doubts, I wrote a small program opens a
> > file and reads one byte at a time and sleeps for a while. I deleted
> > the file while the read was still in progress and I noticed that the
> > read still succeeds. How does this work? Does the kernel not free the
> > inode when the file is deleted but there is a pending read? To check
> > this, instead of deleting, I tried shred-ding the file, the read still
> > gets the correct data.
>
> That's standard UNIX behaviour. The file exists on disk until all
> references to it have disappeared (references including the open
> file handle). All you do by typing "rm" is delete a reference/link to
> it from a particular directory, not (necessarily) delete the file.
> That's why the system call is called "unlink".
>
> Alex
>



-- 
Keep Running.... And Relish the run...
+Fasih



From liuyue at ncic.ac.cn  Thu Jan  3 11:20:37 2008
From: liuyue at ncic.ac.cn (liuyue)
Date: Thu, 3 Jan 2008 19:20:37 +0800
Subject: ext3 peformance problem
Message-ID: <20080103111526.B22051368B1@ncic.ac.cn>

After doing some tests, I think I have found out the reasons.

The read/write performance of an hard disk is not homogenous. The beginning of the disk is stored on the further cylinders from the center, while the end of the disk is stored on the cylinders close to the center. Because the disk rotation speed is constant, and the information density is constant, the raw
performance of the disk is not the same for all cylinders. The performance degradation can be up to 50% when comparing performances at
the beginning of the disk and at the end of the disk.

I creat a small partition on the disk (using the first 2000 cylinders of the total 17000 cylinders on the disk), mount ext3 file system on it and test the performances of different directories. The performances under different directories are nearly the same :)


======= 2008-01-03 19:06:11 ????????=======

>ext3-usershello all,
>
>	I am testing ext3 file system recently but find some problem
>	I use GreatTurbo Enterprise Server 10 (Zuma) and 2.6.20 kernel
>	I conducted my test as follows:
>	
>	mkfs.ext3 /dev/sdb1
>	mount /dev/sdb1 /mnt/test
>	cd /mnt/test
>	mkdir 0 1 2 3 4 5 6 7
>
>	I test write and read performance under different subdirs and give the performance result. I also use filefrag to see the file layout.
>
>Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/0/tmpfile -c -e -+n -w
>   5242880    1024   72706       0    80474        0
>  /mnt/test/0/tmpfile: 44 extents found, perfection would be 41 extents
>
>Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/1/tmpfile -c -e -+n -w
>  5242880    1024   49957       0    52899        0
>/mnt/test/1/tmpfile: 42 extents found, perfection would be 41 extents
>  
>Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/2/tmpfile -c -e -+n -w
>   5242880    1024   60292       0    64664        0
>  /mnt/test/2/tmpfile: 42 extents found, perfection would be 41 extents
>  
>  Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/3/tmpfile -c -e -+n -w
>   5242880    1024   70540       0    78644        0
>   /mnt/test/3/tmpfile: 46 extents found, perfection would be 41 extents
>  
>  Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/4/tmpfile -c -e -+n -w
>    5242880    1024   61334       0    67778        0
>   /mnt/test/4/tmpfile: 44 extents found, perfection would be 41 extents
>     
>    Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile -c -e -+n -w
>      5242880    1024   66735       0    75114        0
>     /mnt/test/5/tmpfile: 42 extents found, perfection would be 41 extents
>     
>      Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/6/tmpfile -c -e -+n -w
>      5242880    1024   65062       0    72686        0
>     /mnt/test/6/tmpfile: 44 extents found, perfection would be 41 extents
>    
>  Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/7/tmpfile -c -e -+n -w
>     5242880    1024   69247       0    78563        0
>    /mnt/test/7/tmpfile: 45 extents found, perfection would be 41 extents
>    
>     Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/tmpfile -c -e -+n -w
>       5242880    1024   77085       0    81696        0
>       
>      Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile2 -c -e -+n -w
>      /mnt/test/5/tmpfile2: 48 extents found, perfection would be 41 extents
>        5242880    1024   57776       0    64870        0
>      
>      Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile3 -c -e -+n -w
>         5242880    1024   54799       0    59145        0
>   /mnt/test/5/tmpfile3: 44 extents found, perfection would be 41 extents
>
>     My questions are:
>1.	why the performances under different subdirs varies so much? In /mnt/test/0 the performance is 72/80, while in /mnt/test/1 the performance is 49/53
>2.	I see that the extents of all files are nearly the same, but their performances are different. What are the other factors that influence the performance except for the extents(fragmentation) of the file? 
>3.	Is it true that the more files there already exists in a dir, the lower performance we will get if we test under the dir?
>as in my test, the performance of /mnt/test/5/tmpfile is 66/75, while the performances of /mnt/test/5/tmpfile2 and tmpfile3 are 57/64 54/59
>
>Thanks very much
>
>																								
>
>

= = = = = = = = = = = = = = = = = = = =
			

?????????
??
 
				 
????????liuyue
????????liuyue at ncic.ac.cn
??????????2008-01-03




From alex at alex.org.uk  Thu Jan  3 11:31:42 2008
From: alex at alex.org.uk (Alex Bligh)
Date: Thu, 03 Jan 2008 11:31:42 +0000
Subject: read() on a deleted file
In-Reply-To: <80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com>
References: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com>	
	<EE31F3345DF775C51B6DB1AB@Ximines.local>
	<80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com>
Message-ID: <A05CCE7DFE95216DD4651D83@Ximines.local>



--On 3 January 2008 16:42:40 +0530 Fasihullah Askiri 
<fasihullah.askiri at gmail.com> wrote:

> Thanx for the response. That is why I tried shred-ding the file. I
> believe that shred overwrites the file inode, if so, shred should have
> led to failures of read() which is not the case. How does that happen?

Buffering / caching of reads.

Alex



From fasihullah.askiri at gmail.com  Thu Jan  3 12:05:03 2008
From: fasihullah.askiri at gmail.com (Fasihullah Askiri)
Date: Thu, 3 Jan 2008 17:35:03 +0530
Subject: read() on a deleted file
In-Reply-To: <1199364407.2930.4.camel@alon>
References: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com>
	<EE31F3345DF775C51B6DB1AB@Ximines.local>
	<80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com>
	<1199364407.2930.4.camel@alon>
Message-ID: <80cd17810801030405o333d2bcdkb93d831a1d4640@mail.gmail.com>

What I meant was, instead of deleting, I tried shredding the file. The
result was still consistent reads. However, after the mail from Alex,
I increased the filesize to see how much does it cache. Turns out on
my system, the read starts returning junk data [that written by shred]
after reading 1040 bytes correctly.

This is what I understand now, if I delete the file, the kernel
guarantees that the file data is preserved till the last reference (in
the form of an open filehandle maybe) lingers. If I shred the file,
the read succeeds till the buffering is done.

This, however sounds wierd to me, what we are essentially saying is
that the open/read might not return the latest data!!!! AFAIK the
buffer cache/inode cache that the kernel maintains is refreshed as
soon the file is modified. Please clarify.

Thanks again for the responses.



On Jan 3, 2008 6:16 PM, Hayim Shaul <hayim at iportent.com> wrote:
> On Thu, 2008-01-03 at 16:42 +0530, Fasihullah Askiri wrote:
> > Thanx for the response. That is why I tried shred-ding the file. I
> > believe that shred overwrites the file inode, if so, shred should have
> > led to failures of read() which is not the case. How does that happen?
> >
>
> What do you mean by re-writing?
> Do you mean opening a new file with the same name and writing into it?
>
> i don't think the new file (necessarily) gets the same inode as the file
> you deleted.
> More specifically, while the inode of the "deleted" file still exists,
> the new inode would most likely to be different.
>
>



-- 
Keep Running.... And Relish the run...
+Fasih



From ling at fnal.gov  Thu Jan  3 15:51:07 2008
From: ling at fnal.gov (Ling C. Ho)
Date: Thu, 03 Jan 2008 09:51:07 -0600
Subject: ext3 peformance problem
In-Reply-To: <20080103111526.B22051368B1@ncic.ac.cn>
References: <20080103111526.B22051368B1@ncic.ac.cn>
Message-ID: <477D046B.1050400@fnal.gov>

I find the "oldalloc" option helpful when doing tests like this, even if
you are writing to a single huge directory. Files/dirs will always be
written close to each other on the disk physically starting from the
beginning of the disk.

...
ling

liuyue wrote:
> After doing some tests, I think I have found out the reasons.
>
> The read/write performance of an hard disk is not homogenous. The beginning of the disk is stored on the further cylinders from the center, while the end of the disk is stored on the cylinders close to the center. Because the disk rotation speed is constant, and the information density is constant, the raw
> performance of the disk is not the same for all cylinders. The performance degradation can be up to 50% when comparing performances at
> the beginning of the disk and at the end of the disk.
>
> I creat a small partition on the disk (using the first 2000 cylinders of the total 17000 cylinders on the disk), mount ext3 file system on it and test the performances of different directories. The performances under different directories are nearly the same :)
>
>
> ======= 2008-01-03 19:06:11 ????????=======
>
>   
>> ext3-usershello all,
>>
>> 	I am testing ext3 file system recently but find some problem
>> 	I use GreatTurbo Enterprise Server 10 (Zuma) and 2.6.20 kernel
>> 	I conducted my test as follows:
>> 	
>> 	mkfs.ext3 /dev/sdb1
>> 	mount /dev/sdb1 /mnt/test
>> 	cd /mnt/test
>> 	mkdir 0 1 2 3 4 5 6 7
>>
>> 	I test write and read performance under different subdirs and give the performance result. I also use filefrag to see the file layout.
>>
>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/0/tmpfile -c -e -+n -w
>>   5242880    1024   72706       0    80474        0
>>  /mnt/test/0/tmpfile: 44 extents found, perfection would be 41 extents
>>
>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/1/tmpfile -c -e -+n -w
>>  5242880    1024   49957       0    52899        0
>> /mnt/test/1/tmpfile: 42 extents found, perfection would be 41 extents
>>  
>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/2/tmpfile -c -e -+n -w
>>   5242880    1024   60292       0    64664        0
>>  /mnt/test/2/tmpfile: 42 extents found, perfection would be 41 extents
>>  
>>  Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/3/tmpfile -c -e -+n -w
>>   5242880    1024   70540       0    78644        0
>>   /mnt/test/3/tmpfile: 46 extents found, perfection would be 41 extents
>>  
>>  Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/4/tmpfile -c -e -+n -w
>>    5242880    1024   61334       0    67778        0
>>   /mnt/test/4/tmpfile: 44 extents found, perfection would be 41 extents
>>     
>>    Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile -c -e -+n -w
>>      5242880    1024   66735       0    75114        0
>>     /mnt/test/5/tmpfile: 42 extents found, perfection would be 41 extents
>>     
>>      Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/6/tmpfile -c -e -+n -w
>>      5242880    1024   65062       0    72686        0
>>     /mnt/test/6/tmpfile: 44 extents found, perfection would be 41 extents
>>    
>>  Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/7/tmpfile -c -e -+n -w
>>     5242880    1024   69247       0    78563        0
>>    /mnt/test/7/tmpfile: 45 extents found, perfection would be 41 extents
>>    
>>     Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/tmpfile -c -e -+n -w
>>       5242880    1024   77085       0    81696        0
>>       
>>      Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile2 -c -e -+n -w
>>      /mnt/test/5/tmpfile2: 48 extents found, perfection would be 41 extents
>>        5242880    1024   57776       0    64870        0
>>      
>>      Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile3 -c -e -+n -w
>>         5242880    1024   54799       0    59145        0
>>   /mnt/test/5/tmpfile3: 44 extents found, perfection would be 41 extents
>>
>>     My questions are:
>> 1.	why the performances under different subdirs varies so much? In /mnt/test/0 the performance is 72/80, while in /mnt/test/1 the performance is 49/53
>> 2.	I see that the extents of all files are nearly the same, but their performances are different. What are the other factors that influence the performance except for the extents(fragmentation) of the file? 
>> 3.	Is it true that the more files there already exists in a dir, the lower performance we will get if we test under the dir?
>> as in my test, the performance of /mnt/test/5/tmpfile is 66/75, while the performances of /mnt/test/5/tmpfile2 and tmpfile3 are 57/64 54/59
>>
>> Thanks very much
>>
>> 																								
>>
>>
>>     
>
> = = = = = = = = = = = = = = = = = = = =
> 			
>
> ?????????
> ??
>  
> 				 
> ????????liuyue
> ????????liuyue at ncic.ac.cn
> ??????????2008-01-03
>
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>
>   





From davids at webmaster.com  Thu Jan  3 20:45:40 2008
From: davids at webmaster.com (David Schwartz)
Date: Thu, 3 Jan 2008 12:45:40 -0800
Subject: read() on a deleted file
In-Reply-To: <80cd17810801030405o333d2bcdkb93d831a1d4640@mail.gmail.com>
Message-ID: <MDEHLPKNGKAHNMBLJOLKKEKDJDAC.davids@webmaster.com>


> This is what I understand now, if I delete the file, the kernel
> guarantees that the file data is preserved till the last reference (in
> the form of an open filehandle maybe) lingers. If I shred the file,
> the read succeeds till the buffering is done.

Actually, you can't delete a file while there are references to it. You can
remove it from its directory, which reduces the reference count by one, but
that's it. That's why the system call in UNIX is called "unlink" rather than
"delete".

A file is automatically deleted when its reference count goes to zero.
Putting a file in a directory adds one to its reference count. Opening a
file adds one.

> This, however sounds wierd to me, what we are essentially saying is
> that the open/read might not return the latest data!!!! AFAIK the
> buffer cache/inode cache that the kernel maintains is refreshed as
> soon the file is modified. Please clarify.

It's impossible to clarify unless you tell us more precisely what you are
doing. For example, you use the term "shred", but that can mean way more
than one thing. Also, when you talk about "reading" a file, that could mean
the "read" system call, but it could also mean the "fread" library function.

DS




From fasihullah.askiri at gmail.com  Fri Jan  4 06:08:53 2008
From: fasihullah.askiri at gmail.com (Fasihullah Askiri)
Date: Fri, 4 Jan 2008 11:38:53 +0530
Subject: read() on a deleted file
In-Reply-To: <MDEHLPKNGKAHNMBLJOLKKEKDJDAC.davids@webmaster.com>
References: <80cd17810801030405o333d2bcdkb93d831a1d4640@mail.gmail.com>
	<MDEHLPKNGKAHNMBLJOLKKEKDJDAC.davids@webmaster.com>
Message-ID: <80cd17810801032208g78509444q9d37162361c68156@mail.gmail.com>

Hi

Sorry for the confusion caused. I just realized that I was using fread
and not read. By "shred", I meant the /usr/bin/shred program which
overwrites a file with junk. I was getting stale result because of the
buffering at fread.

Thanks again for the responses.

On Jan 4, 2008 2:15 AM, David Schwartz <davids at webmaster.com> wrote:
>
> > This is what I understand now, if I delete the file, the kernel
> > guarantees that the file data is preserved till the last reference (in
> > the form of an open filehandle maybe) lingers. If I shred the file,
> > the read succeeds till the buffering is done.
>
> Actually, you can't delete a file while there are references to it. You can
> remove it from its directory, which reduces the reference count by one, but
> that's it. That's why the system call in UNIX is called "unlink" rather than
> "delete".
>
> A file is automatically deleted when its reference count goes to zero.
> Putting a file in a directory adds one to its reference count. Opening a
> file adds one.
>
> > This, however sounds wierd to me, what we are essentially saying is
> > that the open/read might not return the latest data!!!! AFAIK the
> > buffer cache/inode cache that the kernel maintains is refreshed as
> > soon the file is modified. Please clarify.
>
> It's impossible to clarify unless you tell us more precisely what you are
> doing. For example, you use the term "shred", but that can mean way more
> than one thing. Also, when you talk about "reading" a file, that could mean
> the "read" system call, but it could also mean the "fread" library function.
>
> DS
>
>
>



-- 
Keep Running.... And Relish the run...
+Fasih



From evoltech at 2inches.com  Sat Jan  5 05:28:33 2008
From: evoltech at 2inches.com (Dennison Williams)
Date: Fri, 04 Jan 2008 21:28:33 -0800
Subject: ext3 filesystem is not recognized
Message-ID: <477F1581.6070003@2inches.com>

Hello all,

I have a few ext3 file systems that are not being recognized.

Here is the setup: MD software RAID 5 on 4 disks (md0), a LVM logical
volume (/dev/volume_group/logical_volume) comprised of one physical
device (/dev/md0), a encryption layer provided by the cryptoloop driver
(losetup -e aes /dev/loop0 /dev/volume_group/logical_volume), then a
EXT3 file system (mkfs.ext3 /dev/loop0).

Recently the RAID device kicked out one of the disks during a large file
transfer. After re-adding the disk to the array (smartctl didn't report
anything wrong with it, I am not sure why this happened), authenticating
against the cryptographic layer, then trying to mount the drive, I get
the following error:

[root at storage redhat]# mount -t ext3 /dev/loop1 /terrorbyte/1/
mount: wrong fs type, bad option, bad superblock on /dev/loop1,

The message in /var/log/message is:
VFS: Can't find ext3 filesystem on dev loop1.

I then tried to e2fsck the /dev/loop1 partition with all of the
different blocks that were reported from:
mke2fs -n /dev/loop1
with no luck still.

I am unsure of where the problem actually is, and how to go about
debugging it. Any suggestions would be appreciated.

Sincerely,
Dennison Williams
-- 
*****************************************************************
* To communicate with me securely, please email me and I will   *
* send you my public key.  We can then verify each others       *
* fingerprints in person, or over the phone.                    *
*                                                               *
* I am open and willing to talk about setting up PGP, the       *
* security problems inherent with PGP, and alternatives to PGP  *
* for secure electronic communication.                          *
*****************************************************************



From darkonc at gmail.com  Sat Jan  5 07:06:19 2008
From: darkonc at gmail.com (Stephen Samuel)
Date: Fri, 4 Jan 2008 23:06:19 -0800
Subject: ext3 filesystem is not recognized
In-Reply-To: <477F21BA.3040007@2inches.com>
References: <477F1581.6070003@2inches.com>
	<6cd50f9f0801042201r16977affw5ec3804ea580c8e2@mail.gmail.com>
	<477F21BA.3040007@2inches.com>
Message-ID: <6cd50f9f0801042306h5a800d62h76a33dc8a66f1678@mail.gmail.com>

Given what  you've described, then only drive that it would make sense to
pull out would be the one that was dropped and then re-inserted.


On Jan 4, 2008 10:20 PM, Dennison Williams <evoltech at 2inches.com> wrote:

> > Did you try and re-insert the kicked-out drive as if it was clean, or
> did
> > you try to re-sync it to the existing filesystem.  If the former, then
> > that's a HUGE mistake because the data on the drive is no longer in sync
> > with what is on the other drives. (unless the entire filesystem was made
> > read-only when (or before) the drive was dropped out.)
>
> I re-inserted it with:
> mdadm /dev/md0 --add /dev/sde
> At which point it seemed to resync with the raid device (ie. the output
> of /proc/mdstat showed that it was incrementally syncing)
>
> > Check the SMART logs for each of the drives to see if they've had any
> > problems.
>
> there are messages like this:
> /dev/sdc, failed to read SMART Attribute Data
> ...but this wasn't one of the disks that was removed from the raid device

If there are complaints about SDC, then I'd be inclined to do a long test of
it
in smart. it's possible that the real problem started here.

A badblock read test (or just a dd if=/dev/sdc of=/dev/null) would also test
the I/O path between the drive and the CPU. If there are complaints about
that drive, then .. at this point, you should consider it suspicious.

>
> Try pullling the (candidate) compromized drive out of the array and see if
> the (degraded) filesystem works OK and has good data.  If it does, then
I'd
> guess that the pulled drive had bad data written to it somehow --- re-add
it
> (as if it was hot-swapped in), and hope it doesn't happen again.
> Try that with each of the  drives, in turn until you find the badly
written
> drive.  If one of the drives has badly written data, the system really
can't
> tell, for sure, which one is wrong.

I want to make sure I understand you here.  Say my raid device is
> comprised of for devices /dev/md0 = /dev/sd[abcd], are you sugesting
> that for each drive I do somthing like this:
>
> mdadm /dev/md0 --fail /dev/sda --remove /dev/sda

Don't bother. If the drive got resynced, then pulling it won't do any good
unless software RAID gets silently confused by random data on one plex,

>
>
> then try to mount up the FS as usual to see if it is there?  Wouldn't
> this point be moot if the device already re-assembled itself?
>
Yes. it would be moot.

>
> >
> > [[ unless the array was read-only when the drive was dropped, then you
> will
> > only have any hope of good data with the dropped drive pulled ]]
>
> It wasn't read-only, but nothing was writing to it.
>
> Thanks for your time and prompt response.
> Sincerely,
> Dennison Williams
>

Unless noatime was set, then the drive was being written to (if only atime
data).  if all that got scrambled was atime data you should still have been
able to mount the drive.


-- 
Stephen Samuel http://www.bcgreen.com
778-861-7641
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080104/332e294e/attachment.htm>

From evoltech at 2inches.com  Sun Jan  6 08:15:26 2008
From: evoltech at 2inches.com (Dennison Williams)
Date: Sun, 06 Jan 2008 00:15:26 -0800
Subject: ext3 filesystem is not recognized
In-Reply-To: <6cd50f9f0801042306h5a800d62h76a33dc8a66f1678@mail.gmail.com>
References: <477F1581.6070003@2inches.com>	
	<6cd50f9f0801042201r16977affw5ec3804ea580c8e2@mail.gmail.com>	
	<477F21BA.3040007@2inches.com>
	<6cd50f9f0801042306h5a800d62h76a33dc8a66f1678@mail.gmail.com>
Message-ID: <47808E1E.1030406@2inches.com>

Stephen Samuel wrote:
> Given what  you've described, then only drive that it would make sense to
> pull out would be the one that was dropped and then re-inserted.

I did this with the following set of commands:
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sdf /dev/sdg /dev/sdh
mdadm --run /dev/md1
lvchange -a y /dev/volume_group/logical_volume
losetup -e aes /dev/loop1 /dev/volume_group/logical_volume
mount -t ext3 -o ro /dev/loop1 /mnt/logical_volume

and got the same errors; "mount: wrong fs type, bad option, bad
superblock on /dev/loop1"

>>> Check the SMART logs for each of the drives to see if they've had any
>>> problems.
>> there are messages like this:
>> /dev/sdc, failed to read SMART Attribute Data
>> ...but this wasn't one of the disks that was removed from the raid device
> 
> If there are complaints about SDC, then I'd be inclined to do a long test of
> it
> in smart. it's possible that the real problem started here.
> 
> A badblock read test (or just a dd if=/dev/sdc of=/dev/null) would also test
> the I/O path between the drive and the CPU. If there are complaints about
> that drive, then .. at this point, you should consider it suspicious.

Ran "dd if=/dev/sdc of=/dev/null" while monitoring /var/log/messages,
with no messages.  Must have been a fluke.  I will try doing a extended
run of smartctl.

>> Try pullling the (candidate) compromized drive out of the array and see if
>> the (degraded) filesystem works OK and has good data.  If it does, then
> I'd
>> guess that the pulled drive had bad data written to it somehow --- re-add
> it
>> (as if it was hot-swapped in), and hope it doesn't happen again.
>> Try that with each of the  drives, in turn until you find the badly
> written
>> drive.  If one of the drives has badly written data, the system really
> can't
>> tell, for sure, which one is wrong.

I ended up doing this with each drive as above and still the FS wasn't
recognized.

One thing that confuses me though is that the data seems to be partially
valid.  When the array device is assembled and running the logical
volume is recognized, and furthermore losetup accepts the correct
password.  The only thing that doesn't seem to be in working order is
the ext3 filesystem.

In the linux encryption howto
(http://encryptionhowto.sourceforge.net/Encryption-HOWTO-6.html, section
6.1), there is a entry describing possible problems if the kernel was
compiled without CONFIG_BLK_LOOP_DEV_USE_REL_BLOCK.  I can't find this
option anywhere in the config for my kernel (2.6.18-1.2798.fc6xen).  At
this point I am thinking that the problem is at the cryptoloop or ext3
level, but I am not sure what else I can do to check.  Any more ideas?

Sincerely,
Dennison Williams



From liuyue at ncic.ac.cn  Mon Jan  7 02:55:27 2008
From: liuyue at ncic.ac.cn (liuyue)
Date: Mon, 7 Jan 2008 10:55:27 +0800
Subject: ext3 peformance problem
Message-ID: <20080107024936.85CB9136935@ncic.ac.cn>

It does help !!

======= 2008-01-03 23:51:07 ????????=======

>I find the "oldalloc" option helpful when doing tests like this, even if
>you are writing to a single huge directory. Files/dirs will always be
>written close to each other on the disk physically starting from the
>beginning of the disk.
>
>...
>ling
>
>liuyue wrote:
>> After doing some tests, I think I have found out the reasons.
>>
>> The read/write performance of an hard disk is not homogenous. The beginning of the disk is stored on the further cylinders from the center, while the end of the disk is stored on the cylinders close to the center. Because the disk rotation speed is constant, and the information density is constant, the raw
>> performance of the disk is not the same for all cylinders. The performance degradation can be up to 50% when comparing performances at
>> the beginning of the disk and at the end of the disk.
>>
>> I creat a small partition on the disk (using the first 2000 cylinders of the total 17000 cylinders on the disk), mount ext3 file system on it and test the performances of different directories. The performances under different directories are nearly the same :)
>>
>>
>> ======= 2008-01-03 19:06:11 ????????=======
>>
>>   
>>> ext3-usershello all,
>>>
>>> 	I am testing ext3 file system recently but find some problem
>>> 	I use GreatTurbo Enterprise Server 10 (Zuma) and 2.6.20 kernel
>>> 	I conducted my test as follows:
>>> 	
>>> 	mkfs.ext3 /dev/sdb1
>>> 	mount /dev/sdb1 /mnt/test
>>> 	cd /mnt/test
>>> 	mkdir 0 1 2 3 4 5 6 7
>>>
>>> 	I test write and read performance under different subdirs and give the performance result. I also use filefrag to see the file layout.
>>>
>>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/0/tmpfile -c -e -+n -w
>>>   5242880    1024   72706       0    80474        0
>>>  /mnt/test/0/tmpfile: 44 extents found, perfection would be 41 extents
>>>
>>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/1/tmpfile -c -e -+n -w
>>>  5242880    1024   49957       0    52899        0
>>> /mnt/test/1/tmpfile: 42 extents found, perfection would be 41 extents
>>>  
>>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/2/tmpfile -c -e -+n -w
>>>   5242880    1024   60292       0    64664        0
>>>  /mnt/test/2/tmpfile: 42 extents found, perfection would be 41 extents
>>>  
>>>  Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/3/tmpfile -c -e -+n -w
>>>   5242880    1024   70540       0    78644        0
>>>   /mnt/test/3/tmpfile: 46 extents found, perfection would be 41 extents
>>>  
>>>  Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/4/tmpfile -c -e -+n -w
>>>    5242880    1024   61334       0    67778        0
>>>   /mnt/test/4/tmpfile: 44 extents found, perfection would be 41 extents
>>>     
>>>    Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile -c -e -+n -w
>>>      5242880    1024   66735       0    75114        0
>>>     /mnt/test/5/tmpfile: 42 extents found, perfection would be 41 extents
>>>     
>>>      Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/6/tmpfile -c -e -+n -w
>>>      5242880    1024   65062       0    72686        0
>>>     /mnt/test/6/tmpfile: 44 extents found, perfection would be 41 extents
>>>    
>>>  Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/7/tmpfile -c -e -+n -w
>>>     5242880    1024   69247       0    78563        0
>>>    /mnt/test/7/tmpfile: 45 extents found, perfection would be 41 extents
>>>    
>>>     Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/tmpfile -c -e -+n -w
>>>       5242880    1024   77085       0    81696        0
>>>       
>>>      Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile2 -c -e -+n -w
>>>      /mnt/test/5/tmpfile2: 48 extents found, perfection would be 41 extents
>>>        5242880    1024   57776       0    64870        0
>>>      
>>>      Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile3 -c -e -+n -w
>>>         5242880    1024   54799       0    59145        0
>>>   /mnt/test/5/tmpfile3: 44 extents found, perfection would be 41 extents
>>>
>>>     My questions are:
>>> 1.	why the performances under different subdirs varies so much? In /mnt/test/0 the performance is 72/80, while in /mnt/test/1 the performance is 49/53
>>> 2.	I see that the extents of all files are nearly the same, but their performances are different. What are the other factors that influence the performance except for the extents(fragmentation) of the file? 
>>> 3.	Is it true that the more files there already exists in a dir, the lower performance we will get if we test under the dir?
>>> as in my test, the performance of /mnt/test/5/tmpfile is 66/75, while the performances of /mnt/test/5/tmpfile2 and tmpfile3 are 57/64 54/59
>>>
>>> Thanks very much
>>>
>>> 																								
>>>
>>>
>>>     
>>
>> = = = = = = = = = = = = = = = = = = = =
>> 			
>>
>> ?????????
>> ??
>>  
>> 				 
>> ????????liuyue
>> ????????liuyue at ncic.ac.cn
>> ??????????2008-01-03
>>
>>
>> _______________________________________________
>> Ext3-users mailing list
>> Ext3-users at redhat.com
>> https://www.redhat.com/mailman/listinfo/ext3-users
>>
>>   

= = = = = = = = = = = = = = = = = = = =
			

?????????
??
 
				 
????????liuyue
????????liuyue at ncic.ac.cn
??????????2008-01-07




From lakshmipathi.g at gmail.com  Mon Jan  7 04:53:46 2008
From: lakshmipathi.g at gmail.com (lakshmi pathi)
Date: Mon, 7 Jan 2008 10:23:46 +0530
Subject: How to flush file system buffers?
Message-ID: <ae2f51270801062053h1df4a6d4t38c4faa952d6865c@mail.gmail.com>

Hi all,
I want to know whether there is any system call available to flush all ext3
file system buffer(especially the inode cache buffer ) to disk.
I tried sync() but seems like not working for me.Any thoughts?
-Laks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080107/a77f41b5/attachment.htm>

From hayim at iportent.com  Thu Jan  3 12:46:47 2008
From: hayim at iportent.com (Hayim Shaul)
Date: Thu, 03 Jan 2008 14:46:47 +0200
Subject: read() on a deleted file
In-Reply-To: <80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com>
References: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com>
	<EE31F3345DF775C51B6DB1AB@Ximines.local>
	<80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com>
Message-ID: <1199364407.2930.4.camel@alon>

On Thu, 2008-01-03 at 16:42 +0530, Fasihullah Askiri wrote:
> Thanx for the response. That is why I tried shred-ding the file. I
> believe that shred overwrites the file inode, if so, shred should have
> led to failures of read() which is not the case. How does that happen?
> 

What do you mean by re-writing?
Do you mean opening a new file with the same name and writing into it?

i don't think the new file (necessarily) gets the same inode as the file
you deleted.
More specifically, while the inode of the "deleted" file still exists,
the new inode would most likely to be different.



From darkonc at gmail.com  Mon Jan  7 22:56:17 2008
From: darkonc at gmail.com (Stephen Samuel)
Date: Mon, 7 Jan 2008 14:56:17 -0800
Subject: read() on a deleted file
In-Reply-To: <1199364407.2930.4.camel@alon>
References: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com>
	<EE31F3345DF775C51B6DB1AB@Ximines.local>
	<80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com>
	<1199364407.2930.4.camel@alon>
Message-ID: <6cd50f9f0801071456l75516eebmbb8bc0bd34dc5055@mail.gmail.com>

Most likely??!
Until you delete all links and  close all open file descriptors and the
Inode is deallocated, you are GUARANTEED to get a different inode if you
create a new file (of any name).

On Jan 3, 2008 4:46 AM, Hayim Shaul <hayim at iportent.com> wrote:

> On Thu, 2008-01-03 at 16:42 +0530, Fasihullah Askiri wrote:
> > Thanx for the response. That is why I tried shred-ding the file. I
> > believe that shred overwrites the file inode, if so, shred should have
> > led to failures of read() which is not the case. How does that happen?
> >
>
> What do you mean by re-writing?
> Do you mean opening a new file with the same name and writing into it?
>
> i don't think the new file (necessarily) gets the same inode as the file
> you deleted.
> More specifically, while the inode of the "deleted" file still exists,
> the new inode would most likely to be different.
>
>


-- 
Stephen Samuel http://www.bcgreen.com
778-861-7641
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080107/0ce59674/attachment.htm>

From amillionlobsters at gmail.com  Fri Jan 11 04:32:37 2008
From: amillionlobsters at gmail.com (Paul d'Aoust)
Date: Thu, 10 Jan 2008 20:32:37 -0800
Subject: root inode corrupted; tries to clear and reallocate, but can't
Message-ID: <78dcc8220801102032x2ea66985x6ab43286f824d3b5@mail.gmail.com>

Hi there. I think I fscked up my filesystem (betcha nobody's used that
one before!). I made the mistake of fscking an online ext3 filesystem
(guess I wasn't paying attention or I was sick of it being so paranoid
or something) and quickly discovered why I'm not supposed to do that.
The root inode somehow got corrupted, and a whole bunch of inodes
started claiming the same blocks. Here's the result of my attempt to
mount:

mount: wrong fs type, bad option, bad superblock on /dev/hda1, missing
codepage or helper program, or other error. In some cases useful info
is found in syslog -- try dmesg | tail or so

So 'dmesg' reveals this:

EXT3-fs: corrupt root inode, run e2fsck

Then, when I run e2fsck, the first thing it says is

Root inode is not a directory. Clear?

I say 'yes', and then it proceeds to correct and then delete the
parent entry for every inode in the root directory (owing to the fact
that their parent, inode 2, has just been cleared). Here's the exact
wording:

Missing '..' in directory inode 5406734.
Fix? yes

Entry '..' in ... (5406734) has deleted/unused inode 2. Clear? yes

Then, in pass 3, when it tries to repair the root inode, it says

Root inode not allocated. Allocate? yes

Error creating root directory (extfs_new_block): Could not allocate
block in ext2 filesystem
e2fsck: aborted

Now, I know I have more than just a couple free blocks, partly because
debug2fs says so, and partly because I've tried deleting inodes and
freeing up blocks. Some I deleted when e2fsck asked me if I wanted to
clone or delete the multiply-claimed blocks, and some I deleted by
using 'iclr' in debug2fs. I've tried unallocating the root inode and
its block manually, and it still says it can't allocate any block in
the filesystem when it tries to rebuild the root inode.

If anybody has some insight or suggestions, I would love to hear them!

Thanks in advance,
Paul d'Aoust



From jss at ast.cam.ac.uk  Fri Jan 11 11:54:46 2008
From: jss at ast.cam.ac.uk (Jeremy Sanders)
Date: Fri, 11 Jan 2008 11:54:46 +0000
Subject: Checksumming layer
Message-ID: <fm7le6$o0k$1@ger.gmane.org>

Is there any sort of checksumming layer that could lie between the disk and
ext3, or be implemented as part of ext3/4?

We've just had a couple of drives recently where the drive started silently
corrupting the data without generating any I/O or SMART errors. This is
pretty disastrous as you don't necessarily find out about the corruption
until it is too late.

I imagine the overhead of such a layer wouldn't be that much. I would pay a
few percent performance for knowing that the data is not corrupt.

Jeremy

-- 
Jeremy Sanders <jss at ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053



From lists at nerdbynature.de  Fri Jan 11 12:26:40 2008
From: lists at nerdbynature.de (Christian Kujau)
Date: Fri, 11 Jan 2008 13:26:40 +0100 (CET)
Subject: Checksumming layer
In-Reply-To: <fm7le6$o0k$1@ger.gmane.org>
References: <fm7le6$o0k$1@ger.gmane.org>
Message-ID: <50295.62.180.231.196.1200054400.squirrel@housecafe.dyndns.org>

On Fri, January 11, 2008 12:54, Jeremy Sanders wrote:
> Is there any sort of checksumming layer that could lie between the disk
> and ext3, or be implemented as part of ext3/4?

http://www.bullopensource.org/ext4/files/ext4.txt notes:

  * journal checksumming for robustness, performance (prototype exists)
  Features like metadata checksumming have been discussed and planned for
  a bit but no patches exist yet so I'm not sure they're in the near-term
  roadmap.

...but apart from that, only ZFS comes to mind:
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Checksums

C.
-- 
make bzImage, not war



From jprats at cesca.es  Fri Jan 11 12:38:21 2008
From: jprats at cesca.es (Jordi Prats)
Date: Fri, 11 Jan 2008 13:38:21 +0100
Subject: Checksumming layer
In-Reply-To: <fm7le6$o0k$1@ger.gmane.org>
References: <fm7le6$o0k$1@ger.gmane.org>
Message-ID: <4787633D.5070905@cesca.es>

Hi,
You could use tripwire to check periodically all files instead of relay 
on the file system for that task. (I think no file system does this 
checking by now)

Jordi

Jeremy Sanders wrote:
> Is there any sort of checksumming layer that could lie between the disk and
> ext3, or be implemented as part of ext3/4?
>
> We've just had a couple of drives recently where the drive started silently
> corrupting the data without generating any I/O or SMART errors. This is
> pretty disastrous as you don't necessarily find out about the corruption
> until it is too late.
>
> I imagine the overhead of such a layer wouldn't be that much. I would pay a
> few percent performance for knowing that the data is not corrupt.
>
> Jeremy
>
>   


-- 
......................................................................
         __
        / /          Jordi Prats
  C E / S / C A      Dept. de Sistemes
      /_/            Centre de Supercomputaci? de Catalunya

  Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona
  T. 93 205 6464 ? F.  93 205 6979 ? jprats at cesca.es
...................................................................... 



From jss at ast.cam.ac.uk  Fri Jan 11 12:44:31 2008
From: jss at ast.cam.ac.uk (Jeremy Sanders)
Date: Fri, 11 Jan 2008 12:44:31 +0000
Subject: Checksumming layer
References: <fm7le6$o0k$1@ger.gmane.org> <4787633D.5070905@cesca.es>
Message-ID: <fm7obf$28i$1@ger.gmane.org>

Jordi Prats wrote:

> You could use tripwire to check periodically all files instead of relay
> on the file system for that task. (I think no file system does this
> checking by now)

That's a possible idea.

I would have thought it would be relatively simple to write a block device
which acted a layer between the file system and real block device. I
suppose the difficultly is getting all the corner cases correct. I've never
written any kernel code, so maybe I should investigate doing that for
fun...

Jeremy

-- 
Jeremy Sanders <jss at ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053



From tweeks at rackspace.com  Fri Jan 11 19:55:46 2008
From: tweeks at rackspace.com (tweeks)
Date: Fri, 11 Jan 2008 13:55:46 -0600
Subject: Checksumming layer
In-Reply-To: <fm7obf$28i$1@ger.gmane.org>
References: <fm7le6$o0k$1@ger.gmane.org> <4787633D.5070905@cesca.es>
	<fm7obf$28i$1@ger.gmane.org>
Message-ID: <200801111355.47649.tweeks@rackspace.com>

On Friday 11 January 2008 06:44, Jeremy Sanders wrote:
> Jordi Prats wrote:
> > You could use tripwire to check periodically all files instead of relay
> > on the file system for that task. (I think no file system does this
> > checking by now)
>
> That's a possible idea.
>
> I would have thought it would be relatively simple to write a block device
> which acted a layer between the file system and real block device. I
> suppose the difficultly is getting all the corner cases correct. I've never
> written any kernel code, so maybe I should investigate doing that for
> fun...

All files in the system are already hashed.  You can see this by doing 
an "rpm -Va".  For example.. to create a baseline of a system to compare 
against, just cron a script to:
	rpm -Va > /root/RPMV/system-rpm-baseline.txt

then once/day or whatever, do a diff... or just grep for any "bin" directory 
changes and diff that.  I like this better than messing with tripwire.  It's 
already there, native, and easy to use.

Tweeks


Confidentiality Notice: This e-mail message (including any attached or
embedded documents) is intended for the exclusive and confidential use of the
individual or entity to which this message is addressed, and unless otherwise
expressly indicated, is confidential and privileged information of Rackspace
Managed Hosting. Any dissemination, distribution or copying of the enclosed
material is prohibited. If you receive this transmission in error, please
notify us immediately by e-mail at abuse at rackspace.com, and delete the
original message. Your cooperation is appreciated.



From forest at alittletooquiet.net  Fri Jan 11 20:13:11 2008
From: forest at alittletooquiet.net (Forest Bond)
Date: Fri, 11 Jan 2008 15:13:11 -0500
Subject: Checksumming layer
In-Reply-To: <200801111355.47649.tweeks@rackspace.com>
References: <fm7le6$o0k$1@ger.gmane.org> <4787633D.5070905@cesca.es>
	<fm7obf$28i$1@ger.gmane.org>
	<200801111355.47649.tweeks@rackspace.com>
Message-ID: <20080111201311.GC21140@storm.local.network>

Hi,

On Fri, Jan 11, 2008 at 01:55:46PM -0600, tweeks wrote:
> On Friday 11 January 2008 06:44, Jeremy Sanders wrote:
> > Jordi Prats wrote:
> > > You could use tripwire to check periodically all files instead of relay
> > > on the file system for that task. (I think no file system does this
> > > checking by now)
> >
> > That's a possible idea.
> >
> > I would have thought it would be relatively simple to write a block device
> > which acted a layer between the file system and real block device. I
> > suppose the difficultly is getting all the corner cases correct. I've never
> > written any kernel code, so maybe I should investigate doing that for
> > fun...
> 
> All files in the system are already hashed.  You can see this by doing 
> an "rpm -Va".  For example.. to create a baseline of a system to compare 
> against, just cron a script to:
> 	rpm -Va > /root/RPMV/system-rpm-baseline.txt
> 
> then once/day or whatever, do a diff... or just grep for any "bin" directory 
> changes and diff that.  I like this better than messing with tripwire.  It's 
> already there, native, and easy to use.

This is specific to:

* RPM-based systems
* files provided by RPMs

Consequently, it's only useful on certain systems, and, even then, only with
certain files.  That's not very good coverage, is it?

This is especially true when you consider that the files that came from the
package manager are usually the ones that you don't care about as much when
you've lost data.

-Forest
-- 
Forest Bond
http://www.alittletooquiet.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080111/422a7412/attachment.sig>

From adilger at sun.com  Fri Jan 11 22:09:20 2008
From: adilger at sun.com (Andreas Dilger)
Date: Fri, 11 Jan 2008 15:09:20 -0700
Subject: Checksumming layer
In-Reply-To: <fm7obf$28i$1@ger.gmane.org>
References: <fm7le6$o0k$1@ger.gmane.org> <4787633D.5070905@cesca.es>
	<fm7obf$28i$1@ger.gmane.org>
Message-ID: <20080111220920.GU3351@webber.adilger.int>

On Jan 11, 2008  12:44 +0000, Jeremy Sanders wrote:
> I would have thought it would be relatively simple to write a block device
> which acted a layer between the file system and real block device. I
> suppose the difficultly is getting all the corner cases correct. I've never
> written any kernel code, so maybe I should investigate doing that for
> fun...

I think at one point there was a checksumming loop driver, and adding a
checksumming mechanism to DM wouldn't be so hard either.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From tweeks at rackspace.com  Fri Jan 11 22:52:36 2008
From: tweeks at rackspace.com (tweeks)
Date: Fri, 11 Jan 2008 16:52:36 -0600
Subject: Checksumming layer
In-Reply-To: <20080111201311.GC21140@storm.local.network>
References: <fm7le6$o0k$1@ger.gmane.org>
	<200801111355.47649.tweeks@rackspace.com>
	<20080111201311.GC21140@storm.local.network>
Message-ID: <200801111652.37607.tweeks@rackspace.com>

On Friday 11 January 2008 14:13, Forest Bond wrote:
> Hi,
>
> On Fri, Jan 11, 2008 at 01:55:46PM -0600, tweeks wrote:
> > On Friday 11 January 2008 06:44, Jeremy Sanders wrote:
> > > Jordi Prats wrote:
> > > > You could use tripwire to check periodically all files instead of
> > > > relay on the file system for that task. (I think no file system does
> > > > this checking by now)
> > >
> > > That's a possible idea.
> > >
> > > I would have thought it would be relatively simple to write a block
> > > device which acted a layer between the file system and real block
> > > device. I suppose the difficultly is getting all the corner cases
> > > correct. I've never written any kernel code, so maybe I should
> > > investigate doing that for fun...
> >
> > All files in the system are already hashed.  You can see this by doing
> > an "rpm -Va".  For example.. to create a baseline of a system to compare
> > against, just cron a script to:
> > 	rpm -Va > /root/RPMV/system-rpm-baseline.txt
> >
> > then once/day or whatever, do a diff... or just grep for any "bin"
> > directory changes and diff that.  I like this better than messing with
> > tripwire.  It's already there, native, and easy to use.
>
> This is specific to:
>
> * RPM-based systems
> * files provided by RPMs
> Consequently, it's only useful on certain systems, 

Heh.. well.. last I checked, this is a redhat ext3 list.  Red hat uses rpm.. 
and no one but Red hat still actually uses ext3 right? (hehe)...

> and, even then, only
> with certain files.  That's not very good coverage, is it?

Uhh.. all SYSTEM files.. which is all I'm looking at when doing compromise 
checks (except for root kits, etc.. for which I use separate tools).


> This is especially true when you consider that the files that came from the
> package manager are usually the ones that you don't care about as much when
> you've lost data.

You tripwire scan data files? Hmm..

I've seen hundred of compromised servers... 80-90% of them can be detected 
with a simple RPM scan.  The ones you can't are the ones where hacks have 
deleted the RPM DBs.  but in that case, your baseline diff sets off red flags 
anyway.  It's actually a pretty good scan to run nightly/weekly, etc (along 
with root kit scans, etc).  In fact.. I prefer using unorthodox detection 
methods rather than well known forms of F.A.M. (file alteration monitoring) 
like tripwire which if seen, are instantly attacked and disabled.

Tweeks


Confidentiality Notice: This e-mail message (including any attached or
embedded documents) is intended for the exclusive and confidential use of the
individual or entity to which this message is addressed, and unless otherwise
expressly indicated, is confidential and privileged information of Rackspace
Managed Hosting. Any dissemination, distribution or copying of the enclosed
material is prohibited. If you receive this transmission in error, please
notify us immediately by e-mail at abuse at rackspace.com, and delete the
original message. Your cooperation is appreciated.



From jack at suse.cz  Mon Jan 14 17:06:09 2008
From: jack at suse.cz (Jan Kara)
Date: Mon, 14 Jan 2008 18:06:09 +0100
Subject: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66)
In-Reply-To: <477BF72B.4000608@oracle.com>
References: <200712242302.lBON2O8s011190@agora.fsl.cs.sunysb.edu>
	<477BF72B.4000608@oracle.com>
Message-ID: <20080114170609.GH4214@duck.suse.cz>

On Wed 02-01-08 12:42:19, Zach Brown wrote:
> Erez Zadok wrote:
> > Setting: ltp-full-20071031, dio01 test on ext3 with Linus's latest tree.
> > Kernel w/ SMP, preemption, and lockdep configured.
> 
> This is a real lock ordering problem.  Thanks for reporting it.
> 
> The updating of atime inside sys_mmap() orders the mmap_sem in the vfs
> outside of the journal handle in ext3's inode dirtying:
> 
> > -> #1 (jbd_handle){--..}:
> >        [<c023068f>] __lock_acquire+0x9cc/0xb95
> >        [<c0230c2f>] lock_acquire+0x5f/0x78
> >        [<c029c7e9>] journal_start+0xee/0xf8
> >        [<c02959d6>] ext3_journal_start_sb+0x48/0x4a
> >        [<c029152b>] ext3_dirty_inode+0x27/0x6c
> >        [<c026f701>] __mark_inode_dirty+0x29/0x144
> >        [<c026817b>] touch_atime+0xb7/0xbc
> >        [<c023af6d>] generic_file_mmap+0x2d/0x42
> >        [<c024a5cc>] mmap_region+0x1e6/0x3b4
> >        [<c024aa6b>] do_mmap_pgoff+0x1fb/0x253
> >        [<c02067af>] sys_mmap2+0x9b/0xb5
> >        [<c020275e>] syscall_call+0x7/0xb
> >        [<ffffffff>] 0xffffffff
> 
> ext3_direct_IO() orders the journal handle outside of the mmap_sem that
> dio_get_page() acquires to pin pages with get_user_pages():
> 
> > -> #0 (&mm->mmap_sem){----}:
> >        [<c023057f>] __lock_acquire+0x8bc/0xb95
> >        [<c0230c2f>] lock_acquire+0x5f/0x78
> >        [<c0397d4f>] down_read+0x3a/0x4c
> >        [<c02778a2>] dio_get_page+0x4e/0x15d
> >        [<c0278352>] __blockdev_direct_IO+0x431/0xa81
> >        [<c0291318>] ext3_direct_IO+0x10c/0x1a1
> >        [<c023c091>] generic_file_direct_IO+0x124/0x139
> >        [<c023c0fc>] generic_file_direct_write+0x56/0x11c
> >        [<c023ca15>] __generic_file_aio_write_nolock+0x33d/0x489
> >        [<c023cbb9>] generic_file_aio_write+0x58/0xb6
> >        [<c028d4e3>] ext3_file_write+0x27/0x99
> >        [<c0256d0f>] do_sync_write+0xc5/0x102
> >        [<c0257463>] vfs_write+0x90/0x119
> >        [<c0257a25>] sys_write+0x3d/0x61
> >        [<c02026d6>] sysenter_past_esp+0x5f/0xa5
> >        [<ffffffff>] 0xffffffff
> 
> Two fixes come to mind:
> 
> 1) use something like Peter's ->mmap_prepare() to update atime before
> acquiring the mmap_sem.  ( http://lkml.org/lkml/2007/11/11/97 ).  I
> don't know if this would leave more paths which do a journal_start()
> while holding the mmap_sem.
> 
> 2) rework ext3's dio to only hold the jbd handle in ext3_get_block().
> Chris has a patch for this kicking around somewhere but I'm told it has
> problems exposing old blocks in ordered data mode.
> 
> Does anyone have preferences?  I could go either way.  I certainly don't
> like the idea of journal handles being held across the entirety of
> fs/direct-io.c.  It's yet another case of O_DIRECT differing wildly from
> the buffered path :(.
  I've looked more into it and I think that 2) is the only way to go since
transaction start ranks below page lock (standard buffered write path) and
page lock ranks below mmap_sem. So we have at least one more dependency
mmap_sem must go before transaction start...

									Honza
-- 
Jan Kara <jack at suse.cz>
SUSE Labs, CR



From giancarlo.corti at supsi.ch  Tue Jan 22 16:01:50 2008
From: giancarlo.corti at supsi.ch (giancarlo corti)
Date: Tue, 22 Jan 2008 17:01:50 +0100
Subject: forced fsck (again?)
Message-ID: <200801221701.50202.giancarlo.corti@supsi.ch>

hello everyone.

i guess this has been asked before, but haven't found it in the faq.

i have the following issue...

it is not uncommon nowadays to have desktops with filesystems
in the order of 500gb/1tb.

now, my kubuntu (but other distros do the same) forces a fsck
on ext3 every so often, no matter what.

in the past it wasn't a big issue.

but with sizes increasing so much, users are now forced to wait
for several minutes (every so often) for their desktops to boot up.

to the point that the thing has become unacceptable.

i know i can tune/disable this, but i'd like to understand once
and for all what is the technical rationale behind this practice
and what use is there to force a fsck on a clean fs...

i must be missing something... :-(

thanks in advance.

cheers.



From lm at bitmover.com  Tue Jan 22 16:08:59 2008
From: lm at bitmover.com (Larry McVoy)
Date: Tue, 22 Jan 2008 08:08:59 -0800
Subject: forced fsck (again?)
In-Reply-To: <200801221701.50202.giancarlo.corti@supsi.ch>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
Message-ID: <20080122160859.GA25057@bitmover.com>

> i know i can tune/disable this, but i'd like to understand once
> and for all what is the technical rationale behind this practice
> and what use is there to force a fsck on a clean fs...

Disks rot.
-- 
---
Larry McVoy                lm at bitmover.com           http://www.bitkeeper.com



From sandeen at redhat.com  Tue Jan 22 16:10:38 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 22 Jan 2008 10:10:38 -0600
Subject: forced fsck (again?)
In-Reply-To: <200801221701.50202.giancarlo.corti@supsi.ch>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
Message-ID: <4796157E.5040803@redhat.com>

giancarlo corti wrote:
> hello everyone.
> 
> i guess this has been asked before, but haven't found it in the faq.
> 
> i have the following issue...
> 
> it is not uncommon nowadays to have desktops with filesystems
> in the order of 500gb/1tb.
> 
> now, my kubuntu (but other distros do the same) forces a fsck
> on ext3 every so often, no matter what.

Did you just update to e2fsprogs-1.40.3?

If so, should be fixed in 1.40.4 for the most part.

See Debian bug 454926,
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=454926

-Eric



From sandeen at redhat.com  Tue Jan 22 16:11:54 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Tue, 22 Jan 2008 10:11:54 -0600
Subject: forced fsck (again?)
In-Reply-To: <4796157E.5040803@redhat.com>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<4796157E.5040803@redhat.com>
Message-ID: <479615CA.1090408@redhat.com>

Eric Sandeen wrote:
> giancarlo corti wrote:
>> hello everyone.
>>
>> i guess this has been asked before, but haven't found it in the faq.
>>
>> i have the following issue...
>>
>> it is not uncommon nowadays to have desktops with filesystems
>> in the order of 500gb/1tb.
>>
>> now, my kubuntu (but other distros do the same) forces a fsck
>> on ext3 every so often, no matter what.
> 
> Did you just update to e2fsprogs-1.40.3?
> 
> If so, should be fixed in 1.40.4 for the most part.
> 
> See Debian bug 454926,
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=454926

Oh - or, if it's not running each time, and it's just a general question
about periodic fscks, then yeah, what Larry said, I guess.

Although not all filesystems do this.

-Eric



From val.henson at gmail.com  Tue Jan 22 22:34:35 2008
From: val.henson at gmail.com (Valerie Henson)
Date: Tue, 22 Jan 2008 14:34:35 -0800
Subject: forced fsck (again?)
In-Reply-To: <479615CA.1090408@redhat.com>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com>
Message-ID: <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>

On Jan 22, 2008 8:11 AM, Eric Sandeen <sandeen at redhat.com> wrote:
> Eric Sandeen wrote:
> > giancarlo corti wrote:
> >> hello everyone.
> >>
> >> i guess this has been asked before, but haven't found it in the faq.
> >>
> >> i have the following issue...
> >>
> >> it is not uncommon nowadays to have desktops with filesystems
> >> in the order of 500gb/1tb.
> >>
> >> now, my kubuntu (but other distros do the same) forces a fsck
> >> on ext3 every so often, no matter what.
> >
> > Did you just update to e2fsprogs-1.40.3?
> >
> > If so, should be fixed in 1.40.4 for the most part.
> >
> > See Debian bug 454926,
> > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=454926
>
> Oh - or, if it's not running each time, and it's just a general question
> about periodic fscks, then yeah, what Larry said, I guess.
>
> Although not all filesystems do this.

This will be ironic coming from me, but I think the ext3 defaults for
forcing a file system check are a little too conservative for many
modern use cases.  The two cases I have in mind in particular are:

* Servers with long uptimes that need very low data unavailability
times.  Imagine you have a machine room full of servers that have all
been up and running happily for more than 180 days - the preferred
case.  Now imagine that the room overheats and the emergency power cut
is tripped.  Standard heat reduction is swiftly applied (i.e., open
the door and turn on a fan and hope security doesn't notice) and the
power turned back on.  Now your entire machine room will be fscking
for the next 3 hours and whatever service they provide will be
completely unavailable.  Of course, any admin worth their salt will
turn off force fsck so it only runs during controlled downtime...
won't they?

* Laptops.  If suspend and resume doesn't work on your laptop, you'll
be rebooting (and remounting) a lot, perhaps several times a day.  The
preferred solution is to get Matthew Garrett to fix your laptop, but
if you can't, fscking every 10-30 days seems a little excessive.
Desktop users who shutdown daily to save power will have similar
problems.  Distros often have the "don't fsck on battery" option and
some don't use the ext3 defaults for mkfs, but that's only a partial
solution.  In this case, it's definitely a little much to ask a random
laptop user to tune their file system.

I'm not sure what the best solution is - print warnings for several
days/mounts before the force fsck? print warnings but don't force
fsck? increase the default days/mounts before force fsck? base force
fsck intervals on write activity? - but in practice I find myself
telling people about "tune2fs -c 0 -i 0" a lot.  I use it on all my
file systems and run fsck by hand every few months (or more often when
I'm working on fsck :) ).

Disks do rot, and file systems do get corrupted, and fsck should be
run periodically, but the current system of frequent unpredictable
forced fsck at boot is probably not the best cost/benefit tradeoff for
many use cases.

-VAL



From tytso at MIT.EDU  Tue Jan 22 22:52:48 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Tue, 22 Jan 2008 17:52:48 -0500
Subject: forced fsck (again?)
In-Reply-To: <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
Message-ID: <20080122225248.GD1659@mit.edu>

On Tue, Jan 22, 2008 at 02:34:35PM -0800, Valerie Henson wrote:
> This will be ironic coming from me, but I think the ext3 defaults for
> forcing a file system check are a little too conservative for many
> modern use cases.  The two cases I have in mind in particular are:

Yeah.  To the extent that people are using devicemapper/LVM
everywhere, there is a much better solution.  To wit:

#!/bin/sh
#
# e2croncheck

VG=closure
VOLUME=root
SNAPSIZE=100m
EMAIL=tytso at mit.edu

TMPFILE=`mktemp -t e2fsck.log.XXXXXXXXXX`

set -e
START="$(date +'%Y%m%d%H%M%S')"
lvcreate -s -L ${SNAPSIZE} -n "${VOLUME}-snap" "${VG}/${VOLUME}"
if nice logsave -as $TMPFILE e2fsck -p -C 0 "/dev/${VG}/${VOLUME}-snap" && \
   nice logsave -as $TMPFILE e2fsck -fy -C 0 "/dev/${VG}/${VOLUME}-snap" ; then
  echo 'Background scrubbing succeeded!'
  tune2fs -C 0 -T "${START}" "/dev/${VG}/${VOLUME}"
else
  echo 'Background scrubbing failed! Reboot to fsck soon!'
  tune2fs -C 16000 -T "19000101" "/dev/${VG}/${VOLUME}"
  if test -n "EMAIL"; then 
    mail -s "E2fsck of /dev/${VG}/${VOLUME} failed!" $EMAIL < $TMPFILE
  fi
fi
lvremove -f "${VG}/${VOLUME}-snap"
rm $TMPFILE

> * Servers with long uptimes that need very low data unavailability
> times.  Imagine you have a machine room full of servers that have all
> been up and running happily for more than 180 days - the preferred
> case.

And the server should be checking the filesystem every month or so.
But with the long, extended uptime, it doesn't happen.  Using LVM and
the above script solves that problem.

> * Laptops.  If suspend and resume doesn't work on your laptop, you'll
> be rebooting (and remounting) a lot, perhaps several times a day.  The
> preferred solution is to get Matthew Garrett to fix your laptop, but
> if you can't, fscking every 10-30 days seems a little excessive.

It's sad that it's <named kernel developer> to get suspend/resume
working.  But yeah, it's either Matthew or someone like Nigel from the
TuxOnIce lists to help you, or maybe a few other people.

Checking from cron is I believe the right answer, here, too, as long
as there is a check to make sure you're running on AC before doing the
check.

So ---- for someone who has time, I offer the following challenge.
Take the above script, and enhance it in the following ways:

	* Read a configuration file to see which filesystem(s) to
          check and to which e-mail the error reports should be sent.

	* Have the script abort the check if the system appears to be
          running off of a battery.

	* Have the config file define a time period (say, 30 days),
          and have the script test to see if the last_mount time is
          greater than the time interval.  If it is, then it does the
          check, otherwise it skips it.

With these enhancements, in the laptop case the script could be fired
off by cron every night at 3am, and if a month has gone by without a
check, AND the laptop is running off the AC mains, the check happens
automatically, in the background.

> I'm not sure what the best solution is - print warnings for several
> days/mounts before the force fsck? print warnings but don't force
> fsck? increase the default days/mounts before force fsck? base force
> fsck intervals on write activity? - but in practice I find myself
> telling people about "tune2fs -c 0 -i 0" a lot.  I use it on all my
> file systems and run fsck by hand every few months (or more often when
> I'm working on fsck :) ).

Well, this isn't a complete solution, because a lot of people don't
use LVM, often because they don't trust initrd's to do the right thing
--- and quite frankly, I can't blame them.  But doing this kind of
thing is so much better that maybe it would actually help convert more
kernel developers to use LVM on their boot filesystem.  (Well,
probably not.  That's probably being too optimistic.  :-)

       		     	     	      -	Ted



From bryan at kadzban.is-a-geek.net  Wed Jan 23 01:50:33 2008
From: bryan at kadzban.is-a-geek.net (Bryan Kadzban)
Date: Tue, 22 Jan 2008 20:50:33 -0500
Subject: forced fsck (again?)
In-Reply-To: <20080122225248.GD1659@mit.edu>
References: <200801221701.50202.giancarlo.corti@supsi.ch>	<4796157E.5040803@redhat.com>
	<479615CA.1090408@redhat.com>	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
	<20080122225248.GD1659@mit.edu>
Message-ID: <47969D69.4060607@kadzban.is-a-geek.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Theodore Tso wrote:
> So ---- for someone who has time, I offer the following challenge.
> Take the above script, and enhance it in the following ways:
> 
> 	* Read a configuration file to see which filesystem(s) to
>           check and to which e-mail the error reports should be sent.

Add support for checking multiple FSes, too.  :-)

> 	* Have the script abort the check if the system appears to be
>           running off of a battery.

Sort of.  Much of this on_ac_power function was stolen from Debian's
powermgmt_base package's on_ac_power script, but it doesn't support
anything other than ACPI.  (It checks the new sysfs power_supply class
first, and the /proc/acpi/ac_adapter/ directory second.)

If the function can't determine if AC power is available, the script
assumes it's on battery, and exits; this is suboptimal for desktops, but
good for laptops that don't have ACPI turned on for whatever reason.

> 	* Have the config file define a time period (say, 30 days),
>           and have the script test to see if the last_mount time is
>           greater than the time interval.  If it is, then it does the
>           check, otherwise it skips it.

Well, this script looks at the last-check time, not the last-mount time.
But close enough.

> With these enhancements, in the laptop case the script could be fired
> off by cron every night at 3am, and if a month has gone by without a
> check, AND the laptop is running off the AC mains, the check happens
> automatically, in the background.

See the attached script (e2check) and sample config file (e2check.conf).
:-)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHlp1nS5vET1Wea5wRA2XkAKC9vPadZzYxbBITFVkSUAntYGOk4QCg4+SZ
QK+2xfdB7wtVF/J152S/P2s=
=lhcS
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: e2check
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080122/e1beb85d/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: e2check.conf
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080122/e1beb85d/attachment.conf>

From tytso at MIT.EDU  Wed Jan 23 03:10:12 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Tue, 22 Jan 2008 22:10:12 -0500
Subject: forced fsck (again?)
In-Reply-To: <47969D69.4060607@kadzban.is-a-geek.net>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
	<20080122225248.GD1659@mit.edu>
	<47969D69.4060607@kadzban.is-a-geek.net>
Message-ID: <20080123031012.GD1320@mit.edu>

On Tue, Jan 22, 2008 at 08:50:33PM -0500, Bryan Kadzban wrote:
> See the attached script (e2check) and sample config file (e2check.conf).
> :-)

Just a few requests.

First of all, can you send me a Signed-Off-By, so I can include it in
future versions of e2fsprogs.  See the SUBMITTING-PATCHES in the
top-level of the e2fsprogs source tree.  (It means the same thing as
the Linux Kernel).

> > 	* Have the script abort the check if the system appears to be
> >           running off of a battery.
> 
> Sort of.  Much of this on_ac_power function was stolen from Debian's
> powermgmt_base package's on_ac_power script, but it doesn't support
> anything other than ACPI.  (It checks the new sysfs power_supply class
> first, and the /proc/acpi/ac_adapter/ directory second.)
> 
> If the function can't determine if AC power is available, the script
> assumes it's on battery, and exits; this is suboptimal for desktops, but
> good for laptops that don't have ACPI turned on for whatever reason.

Yeah, the default needs to be the other way around for servers, which
may not have the ac_adapter interface at all.

> > 	* Have the config file define a time period (say, 30 days),
> >           and have the script test to see if the last_mount time is
> >           greater than the time interval.  If it is, then it does the
> >           check, otherwise it skips it.
> 
> Well, this script looks at the last-check time, not the last-mount time.
> But close enough.

Yeah, that's what I wanted.

> See the attached script (e2check) and sample config file (e2check.conf).
> :-)

Hmm, if you're going to source the config file directly, why not do
this instead:

check_lvm_fs closure root 100m 30
check_lvm_fs closure home 100m 30

instead of this:

> VGS=(closure closure)
> VOLUMES=(root home)
> SNAPSIZES=(100m 100m)
> INTERVALS=(30 30)

If you have six or eight volumes to check, keeping them lined up could
be error-prone.

Thanks for stepping up!

						- Ted



From bryan at kadzban.is-a-geek.net  Wed Jan 23 03:35:43 2008
From: bryan at kadzban.is-a-geek.net (Bryan Kadzban)
Date: Tue, 22 Jan 2008 22:35:43 -0500
Subject: forced fsck (again?)
In-Reply-To: <20080123031012.GD1320@mit.edu>
References: <200801221701.50202.giancarlo.corti@supsi.ch>	<4796157E.5040803@redhat.com>
	<479615CA.1090408@redhat.com>	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>	<20080122225248.GD1659@mit.edu>	<47969D69.4060607@kadzban.is-a-geek.net>
	<20080123031012.GD1320@mit.edu>
Message-ID: <4796B60F.4040009@kadzban.is-a-geek.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Theodore Tso wrote:
> First of all, can you send me a Signed-Off-By,

Sure, I'll do that on the version that has other changes below.

I realized after sending the first version that it probably also needed
to be GPLed, as technically, the attachment before wasn't distributable
at all.  I meant it to be GPL, since it used bits of the powermgmt-base
package from Debian.  I assume that's the license implied, since I'm
submitting it to be included in e2fsprogs?  Or does GPLv2 need to be
mentioned in the files as well?

(Actually, what about GPL versions?  It looks like e2fsprogs is still at
GPL version 2 -- that's OK with me, but do I need to say "v2 only", "v2
or later", or nothing specific?)

> Yeah, the default needs to be the other way around for servers, which
> may not have the ac_adapter interface at all.

Sounds like another config file setting, then.  I can probably simplify
the interface a bit if the decision is made by a config file setting,
too.  (I can make the function return 0 or 1 based on the config file if
it falls through all the checks that are there, instead of returning 255
and making the caller handle it differently.)

> Hmm, if you're going to source the config file directly, why not do 
> this instead:
> 
> check_lvm_fs closure root 100m 30
> check_lvm_fs closure home 100m 30

Are you thinking that the check_lvm_fs calls would be in the config file
(after setting global options), and the check_lvm_fs function would be
defined in the main script?  That's my guess here, and it'd probably
work OK, but it'd take a bit of work.  And it's getting late here, so I
probably won't get it changed until at least tomorrow night.

> If you have six or eight volumes to check, keeping them lined up
> could be error-prone.

That's true.  I was looking for a way to do named array indices, like a
hashtable in most other languages, but it doesn't look like bash has
that ability.  Plain old Bourne sh almost certainly doesn't.  Anyway,
calling the check_lvm_fs function from the config file is a little bit
backwards, but would certainly work better than a bunch of arrays that
all have to be in sync.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHlrYOS5vET1Wea5wRA/ZyAJ9zpvPzCFUl28KuNDJfv+G19yt2IwCgmB5p
D9SIDhoJ3eF7khgXgb0WSXY=
=ea4T
-----END PGP SIGNATURE-----



From darkonc at gmail.com  Wed Jan 23 04:10:25 2008
From: darkonc at gmail.com (Stephen Samuel)
Date: Tue, 22 Jan 2008 20:10:25 -0800
Subject: forced fsck (again?)
In-Reply-To: <4796B60F.4040009@kadzban.is-a-geek.net>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
	<20080122225248.GD1659@mit.edu>
	<47969D69.4060607@kadzban.is-a-geek.net>
	<20080123031012.GD1320@mit.edu>
	<4796B60F.4040009@kadzban.is-a-geek.net>
Message-ID: <6cd50f9f0801222010r374dfe5lcf7bc24b5d2ad82d@mail.gmail.com>

On Jan 22, 2008 7:35 PM, Bryan Kadzban <bryan at kadzban.is-a-geek.net> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
>
> (Actually, what about GPL versions?  It looks like e2fsprogs is still at
> GPL version 2 -- that's OK with me, but do I need to say "v2 only", "v2
> or later", or nothing specific?)
>
My suggestion is 'V2 or later', since that covers V2,  V3 (and, eventually,
Vs 4, 5, 6 and 7).

-- 
Stephen Samuel http://www.bcgreen.com
778-861-7641
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080122/9b6cde70/attachment.htm>

From adilger at sun.com  Wed Jan 23 08:15:48 2008
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 23 Jan 2008 01:15:48 -0700
Subject: forced fsck (again?)
In-Reply-To: <4796B60F.4040009@kadzban.is-a-geek.net>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
	<20080122225248.GD1659@mit.edu>
	<47969D69.4060607@kadzban.is-a-geek.net>
	<20080123031012.GD1320@mit.edu>
	<4796B60F.4040009@kadzban.is-a-geek.net>
Message-ID: <20080123081548.GY3180@webber.adilger.int>

On Jan 22, 2008  22:35 -0500, Bryan Kadzban wrote:
> > Hmm, if you're going to source the config file directly, why not do 
> > this instead:
> > 
> > check_lvm_fs closure root 100m 30
> > check_lvm_fs closure home 100m 30
> 
> Are you thinking that the check_lvm_fs calls would be in the config file
> (after setting global options), and the check_lvm_fs function would be
> defined in the main script?  That's my guess here, and it'd probably
> work OK, but it'd take a bit of work.  And it's getting late here, so I
> probably won't get it changed until at least tomorrow night.

It probably makes more sense just to parse /etc/fstab and check the
filesystems that have PASS != 0 (column 6), since those are the
filesystems that will be automatically checked on the next boot.  This
also avoids more configuration by the user, which is always desirable.

The second benefit of parsing /etc/fstab is that the filesystem type
can be checked and "fsck.{fstype}" used (if available) instead of just
"e2fsck".

Alternately, using "lvscan" to check for mounted LVM filesystems and
their fstype is another option, since there is no guarantee that all
filesystems listed in /etc/fstab are on LVM.  That's what I did in a
very old, but similar, script:

http://osdir.com/ml/linux.lvm.devel/2003-04/msg00001.html

The only unfortunate thing is that I was revalidating this script still
works with LVM2 on my system, and created an LV snapshot (worked OK),
but when I tried to lvremove it immediately thereafter the system went
into 100% IO wait and the lvremove process was unkillable :-(.  This was
the 2.6.16 SLES10 kernel, so that may have been fixed in the meantime...

The LVM functions used in this script still appear to be working with
LVM2, so I think it is still a valid approach.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From adilger at sun.com  Wed Jan 23 09:16:01 2008
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 23 Jan 2008 02:16:01 -0700
Subject: forced fsck (again?)
In-Reply-To: <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
Message-ID: <20080123091601.GZ3180@webber.adilger.int>

On Jan 22, 2008  14:34 -0800, Valerie Henson wrote:
> I'm not sure what the best solution is - print warnings for several
> days/mounts before the force fsck? print warnings but don't force
> fsck? increase the default days/mounts before force fsck?

I believe current e2fsprogs already prints the number of mounts remaining
before e2fsck is forced, though this doesn't help for time-based checks
with a long system uptime.

Conversely, I think for users that have set "-c 0 -i 0" e2fsck should
print a message like "fs mounted 50 times, last e2fsck was 200 days ago"
or similar, if the default limits are exceeded to alert the user that
this might be an issue.

> base force fsck intervals on write activity?

I had submitted a patch ages ago that considered "clean" unmounts
less dangerous than "crash" and only incremented the mount count
about 1/5 times in that case (randomly).

> Disks do rot, and file systems do get corrupted, and fsck should be
> run periodically, but the current system of frequent unpredictable
> forced fsck at boot is probably not the best cost/benefit tradeoff for
> many use cases.

Maybe some of the distro folks (Eric? :-) will pick up on this thread and
consider adding the "e2fsck snapshot" script to cron.monthly or similar.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From sandeen at redhat.com  Wed Jan 23 14:05:21 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Wed, 23 Jan 2008 08:05:21 -0600
Subject: forced fsck (again?)
In-Reply-To: <20080123091601.GZ3180@webber.adilger.int>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
	<20080123091601.GZ3180@webber.adilger.int>
Message-ID: <479749A1.5040208@redhat.com>

Andreas Dilger wrote:

> Maybe some of the distro folks (Eric? :-) will pick up on this thread and
> consider adding the "e2fsck snapshot" script to cron.monthly or similar.

I'm watching.... sure, that might be a candidate for Fedora.  Ideally
it'd be part of e2fsprogs, so we're not carrying/maintaining stuff
that's not upstream.  But Fedora does install onto lvm by default, so it
sounds like a good candidate for fedora.

-Eric

> Cheers, Andreas



From tytso at MIT.EDU  Wed Jan 23 14:08:47 2008
From: tytso at MIT.EDU (Theodore Tso)
Date: Wed, 23 Jan 2008 09:08:47 -0500
Subject: forced fsck (again?)
In-Reply-To: <20080123081548.GY3180@webber.adilger.int>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
	<20080122225248.GD1659@mit.edu>
	<47969D69.4060607@kadzban.is-a-geek.net>
	<20080123031012.GD1320@mit.edu>
	<4796B60F.4040009@kadzban.is-a-geek.net>
	<20080123081548.GY3180@webber.adilger.int>
Message-ID: <20080123140847.GB29321@mit.edu>

On Wed, Jan 23, 2008 at 01:15:48AM -0700, Andreas Dilger wrote:
> It probably makes more sense just to parse /etc/fstab and check the
> filesystems that have PASS != 0 (column 6), since those are the
> filesystems that will be automatically checked on the next boot.  This
> also avoids more configuration by the user, which is always desirable.

I thought of that, but given that you need to configure the e-mail to
send reports, and the snapshot size, we need another configuration
file anyway.  (We could sneek some of that information into the
options field of fstab, since the kernel and other programs that parse
that field just take what they need and ignore the rest, but.... ick,
ick, ick.  :-)

Also, I could imagine that a user might not want to check all of the
filesystems in fstab.

> Alternately, using "lvscan" to check for mounted LVM filesystems and
> their fstype is another option, since there is no guarantee that all
> filesystems listed in /etc/fstab are on LVM.  That's what I did in a
> very old, but similar, script:
> 
> http://osdir.com/ml/linux.lvm.devel/2003-04/msg00001.html

I do like the fact that your script does much better error checking
than mine.  :-)

							- Ted



From adilger at sun.com  Wed Jan 23 19:23:34 2008
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 23 Jan 2008 12:23:34 -0700
Subject: forced fsck (again?)
In-Reply-To: <20080123140847.GB29321@mit.edu>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
	<20080122225248.GD1659@mit.edu>
	<47969D69.4060607@kadzban.is-a-geek.net>
	<20080123031012.GD1320@mit.edu>
	<4796B60F.4040009@kadzban.is-a-geek.net>
	<20080123081548.GY3180@webber.adilger.int>
	<20080123140847.GB29321@mit.edu>
Message-ID: <20080123192334.GG3180@webber.adilger.int>

On Jan 23, 2008  09:08 -0500, Theodore Tso wrote:
> I thought of that, but given that you need to configure the e-mail to
> send reports, and the snapshot size, we need another configuration
> file anyway.  (We could sneek some of that information into the
> options field of fstab, since the kernel and other programs that parse
> that field just take what they need and ignore the rest, but.... ick,
> ick, ick.  :-)

I agree - adding email to fstab is icky and I wouldn't go there.  I don't
see a problem with just emailing it to "root@" by default and giving the
user the option to change it to something else.

> Also, I could imagine that a user might not want to check all of the
> filesystems in fstab.

Similarly, a config file which disables checking on some LV if specified
seems reasonable.  IMHO the main goal is to make things transparent to
the user and avoid their annoyance of "e2fsck at boot".  Since the e2fsck
is on a read-only LV snapshot, there shouldn't be any danger to the
filesystems.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From bryan at kadzban.is-a-geek.net  Thu Jan 24 02:10:31 2008
From: bryan at kadzban.is-a-geek.net (Bryan Kadzban)
Date: Wed, 23 Jan 2008 21:10:31 -0500
Subject: forced fsck (again?)
In-Reply-To: <20080123192334.GG3180@webber.adilger.int>
References: <200801221701.50202.giancarlo.corti@supsi.ch>	<4796157E.5040803@redhat.com>
	<479615CA.1090408@redhat.com>	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>	<20080122225248.GD1659@mit.edu>	<47969D69.4060607@kadzban.is-a-geek.net>	<20080123031012.GD1320@mit.edu>	<4796B60F.4040009@kadzban.is-a-geek.net>	<20080123081548.GY3180@webber.adilger.int>	<20080123140847.GB29321@mit.edu>
	<20080123192334.GG3180@webber.adilger.int>
Message-ID: <4797F397.9020306@kadzban.is-a-geek.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Andreas Dilger wrote:
> On Jan 23, 2008  09:08 -0500, Theodore Tso wrote:
>> (We could sneek some of that information into the options field of
>> fstab, since the kernel and other programs that parse that field
>> just take what they need and ignore the rest, but.... ick, ick,
>> ick.  :-)
> 
> I agree - adding email to fstab is icky and I wouldn't go there.  I
> don't see a problem with just emailing it to "root@" by default and
> giving the user the option to change it to something else.

Since the email address is not per-filesystem, it's fine by me to put it
into a config file somewhere.  Forcing the interval to be global is
probably also OK, although I wouldn't want to be forced to set the
snapshot size globally.  I do think that fstab is the best place for
per-filesystem options, though.

But it's not too difficult to parse out a custom SNAPSIZE option, and
even have a DEFAULT_SNAPSIZE in the config file if no SNAPSIZE option is
present on any LV, if the script is going to parse fstab anyway.  (Or
should the option's name be lowercase?  Either will work.)

>> Also, I could imagine that a user might not want to check all of
>> the filesystems in fstab.
> 
> Similarly, a config file which disables checking on some LV if
> specified seems reasonable.

That does seem reasonable, but I haven't done it in the script that's
attached.  Maybe support for a SKIP (or skip, or e2check_skip, or
skip_e2check, or whatever) option in fstab's options field?

Regarding the idea of having this support multiple filesystems -- that's
a good idea, I think, but the current script is highly specific to ext2
or ext3.  Use of tune2fs (to reset the last-check time) and dumpe2fs (to
find the last-check time), in particular, will be problematic on other
FSes.  I haven't done that in this script, though it may be possible.

Anyway, here's a second version.  I've changed it to parse up fstab,
and added an option for what to do if AC status can't be determined.
Kernel-style changelog entry, etc., below:

- -------

Create a script to transparently run e2fsck in the background on any LVM
logical volumes listed in /etc/fstab, as long as the machine is on AC
power, and that LV has been last checked more than a configurable number
of days ago.  Also create a configuration file to set various options in
the script.

Signed-Off-By: Bryan Kadzban <bryan at kadzban.is-a-geek.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHl/OXS5vET1Wea5wRA/UaAJwIE27W6qasI7Gm/uvZm/pY1rcBtwCcDXYq
cc3qE/uOEqm4ksYHlI6+IJU=
=7Lf3
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: e2check
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080123/b9bc38ad/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: e2check.conf
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080123/b9bc38ad/attachment.conf>

From adilger at sun.com  Thu Jan 24 04:39:30 2008
From: adilger at sun.com (Andreas Dilger)
Date: Wed, 23 Jan 2008 21:39:30 -0700
Subject: forced fsck (again?)
In-Reply-To: <4797F397.9020306@kadzban.is-a-geek.net>
References: <479615CA.1090408@redhat.com>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
	<20080122225248.GD1659@mit.edu>
	<47969D69.4060607@kadzban.is-a-geek.net>
	<20080123031012.GD1320@mit.edu>
	<4796B60F.4040009@kadzban.is-a-geek.net>
	<20080123081548.GY3180@webber.adilger.int>
	<20080123140847.GB29321@mit.edu>
	<20080123192334.GG3180@webber.adilger.int>
	<4797F397.9020306@kadzban.is-a-geek.net>
Message-ID: <20080124043930.GG18433@webber.adilger.int>

On Jan 23, 2008  21:10 -0500, Bryan Kadzban wrote:
> Since the email address is not per-filesystem, it's fine by me to put it
> into a config file somewhere.  Forcing the interval to be global is
> probably also OK, although I wouldn't want to be forced to set the
> snapshot size globally.  I do think that fstab is the best place for
> per-filesystem options, though.
> 
> But it's not too difficult to parse out a custom SNAPSIZE option, and
> even have a DEFAULT_SNAPSIZE in the config file if no SNAPSIZE option is
> present on any LV, if the script is going to parse fstab anyway.  (Or
> should the option's name be lowercase?  Either will work.)

The problem with this is that ext2/3/4, along with most other filesystems
will fail to mount if passed an unknown mount option.

> Regarding the idea of having this support multiple filesystems -- that's
> a good idea, I think, but the current script is highly specific to ext2
> or ext3.  Use of tune2fs (to reset the last-check time) and dumpe2fs (to
> find the last-check time), in particular, will be problematic on other
> FSes.  I haven't done that in this script, though it may be possible.

Well, my equivalent script just checks for fsck.${fstype} and runs that
on the snapshot, if available.   Even if tune2fs isn't there to update
a "last checked" field, it is still a useful indication of the health
of the filesystem for a long-running system.  For filesystems like XFS
where fsck.xfs is (unfortunately) an empty shell that does nothing this
could be special-cased to call xfs_check.

> # parse up fstab
> grep -v '^#' /etc/fstab | grep -v '^$' | awk '$6!=0 {print $1,$3,$4;}' | \
> while read FS FSTYPE OPTIONS ; do

Urk, that is kind of ugly shell scripting...  Cleaner would be:

cat /etc/fstab | while read FS DEV FSTYPE OPTIONS DUMP PASS
	case $FS in
	"") continue ;;
	*#*) continue;;
	esac

But I've come to think that /etc/fstab is the wrong thing to use for
input.  This script is only useful for LVM volumes, so getting a list
of LVs is more appropriate I think.

> 	# get the volume group (or an error message)
> 	VG="`lvs --noheadings -o vg_name "$FS" 2>&1`"

Interesting, I wasn't aware of lvs...  It looks like "lvdisplay -C".

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From menscher at uiuc.edu  Thu Jan 24 08:27:23 2008
From: menscher at uiuc.edu (Damian Menscher)
Date: Thu, 24 Jan 2008 00:27:23 -0800
Subject: forced fsck (again?)
In-Reply-To: <4797F397.9020306@kadzban.is-a-geek.net>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
	<20080122225248.GD1659@mit.edu>
	<47969D69.4060607@kadzban.is-a-geek.net>
	<20080123031012.GD1320@mit.edu>
	<4796B60F.4040009@kadzban.is-a-geek.net>
	<20080123081548.GY3180@webber.adilger.int>
	<20080123140847.GB29321@mit.edu>
	<20080123192334.GG3180@webber.adilger.int>
	<4797F397.9020306@kadzban.is-a-geek.net>
Message-ID: <1d8411e00801240027p5be6d840h43a04f4ee60398e0@mail.gmail.com>

2008/1/23 Bryan Kadzban <bryan at kadzban.is-a-geek.net>:
> But it's not too difficult to parse out a custom SNAPSIZE option, and
> even have a DEFAULT_SNAPSIZE in the config file if no SNAPSIZE option is
> present on any LV, if the script is going to parse fstab anyway.  (Or
> should the option's name be lowercase?  Either will work.)

At the risk of adding complexity, what about having the SNAPSIZE be
automatically determined?  Most users would have no idea what to set
it to, and we should be able to guess some reasonable values.  For
example, the fsck time can probably be estimated by looking at the
number of inodes, how full the filesystem is, etc.  Alternatively, we
could just allocate all available space in the LVM.

I also have a newbie question: does the fsck of a snapshot really
catch everything that might be wrong with the drive, or are there
other failure modes that only a real fsck would catch?  I'm wondering
if it's still a good idea to do an occasional full fsck.

Damian
-- 
http://www.uiuc.edu/~menscher/



From bryan at kadzban.is-a-geek.net  Thu Jan 24 12:19:17 2008
From: bryan at kadzban.is-a-geek.net (Bryan Kadzban)
Date: Thu, 24 Jan 2008 07:19:17 -0500
Subject: forced fsck (again?)
In-Reply-To: <20080124043930.GG18433@webber.adilger.int>
References: <479615CA.1090408@redhat.com>	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>	<20080122225248.GD1659@mit.edu>	<47969D69.4060607@kadzban.is-a-geek.net>	<20080123031012.GD1320@mit.edu>	<4796B60F.4040009@kadzban.is-a-geek.net>	<20080123081548.GY3180@webber.adilger.int>	<20080123140847.GB29321@mit.edu>	<20080123192334.GG3180@webber.adilger.int>	<4797F397.9020306@kadzban.is-a-geek.net>
	<20080124043930.GG18433@webber.adilger.int>
Message-ID: <47988245.4010904@kadzban.is-a-geek.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Andreas Dilger wrote:
> On Jan 23, 2008  21:10 -0500, Bryan Kadzban wrote:
>> But it's not too difficult to parse out a custom SNAPSIZE option,
>> and even have a DEFAULT_SNAPSIZE in the config file if no SNAPSIZE
>> option is present on any LV, if the script is going to parse fstab
>> anyway.  (Or should the option's name be lowercase?  Either will
>> work.)
> 
> The problem with this is that ext2/3/4, along with most other
> filesystems will fail to mount if passed an unknown mount option.

Uh oh.  Yeah, that's a problem.

I was under the impression that all the tools would ignore unknown
options -- if that's not the case, then we probably need to come up with
something else.  Automatically determining the snapshot size sounds like
a good idea, but I'm not sure how to do it.  (I'm not sure what decides
the snapshot size that you need -- it looks like the number of changes
that you're going to make to the snapshot, or maybe the number of
changes that you're going to make to both the snapshot and the real LV?
 In either case, I'm not sure how to find that out.  Maybe just using
"all available space in the VG" is a better idea anyway.)

>> Regarding the idea of having this support multiple filesystems --
>> that's a good idea, I think, but the current script is highly
>> specific to ext2 or ext3.  Use of tune2fs (to reset the last-check
>> time) and dumpe2fs (to find the last-check time), in particular,
>> will be problematic on other FSes.  I haven't done that in this
>> script, though it may be possible.
> 
> Well, my equivalent script just checks for fsck.${fstype} and runs
> that on the snapshot, if available.   Even if tune2fs isn't there to
> update a "last checked" field, it is still a useful indication of the
> health of the filesystem for a long-running system.

True, but what about determining whether it has to run at all (based on
the last-check time)?  Although, I suppose it would work to leave the
check interval set in the superblock, and avoid using fsck.* -f; that
way each fsck would be able to determine if it should do a full check or
not.

Of course that means that if you can't update the last-checked time,
then it'll run a check every day after the interval passes (and the
machine is on AC).  Of course the current script will do that too, so at
least it isn't any worse there.

>> grep -v '^#' /etc/fstab | grep -v '^$' | awk '$6!=0 {print $1,$3,$4;}' | \
>> while read FS FSTYPE OPTIONS ; do
> 
> Urk, that is kind of ugly shell scripting...

Yeah, no kidding.  I wanted to kill lines with fs_passno set to zero,
since I was already killing lines that were empty or comments.  I was
also afraid that sh would die if I gave "read" more variables than
arguments (which is why I wanted to filter out the comments), but doing
some testing shows that bash (at least) handles it OK.  So maybe a
normal read would work better.

Or maybe rewriting in C would work; then I could just use getmntent.
Although I'm not exactly a fan of writing something like this in C,
either; shell is more powerful, except for this "reading fstab" thing.

> But I've come to think that /etc/fstab is the wrong thing to use for 
> input.  This script is only useful for LVM volumes, so getting a list
> of LVs is more appropriate I think.

True, except the no-LVs behavior of lvscan, lvs, and any of the other
tools that I was looking at yesterday is decidedly non-optimal.  It
would probably be possible; I'll see what I can find out later today.  I
have a QEMU VM set up whose root FS is on LVM, on MD-raid, on DM-raid (I
was testing an initramfs setup's worst-case), so it has the LVM tools
and filesystems.  I'll see what's available there.

We'd still need to find the FS type, although I believe udev provides
some programs that may be helpful (if we want to rely on them being
installed).  volume_id, in particular, should provide that info.

>> # get the volume group (or an error message)
>> VG="`lvs --noheadings -o vg_name "$FS" 2>&1`"
> 
> Interesting, I wasn't aware of lvs...  It looks like "lvdisplay -C".

Sort of, although I'm not sure what -C does (it's not in my lvdisplay
manpage).  That manpage refers to lvs (saying "lvs provides considerably
more control over the output"), and that was what I was looking for.
It's fairly easy to get it to print just the VG or just the LV, which is
what I needed.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHmIJES5vET1Wea5wRA417AKDInHscG+5bta5gSiC2hJ3QKeN05ACgzeCQ
8Wpo9KPog+p1gZMzrgN+Yp8=
=XgD8
-----END PGP SIGNATURE-----



From bryan at kadzban.is-a-geek.net  Thu Jan 24 12:20:31 2008
From: bryan at kadzban.is-a-geek.net (Bryan Kadzban)
Date: Thu, 24 Jan 2008 07:20:31 -0500
Subject: forced fsck (again?)
In-Reply-To: <1d8411e00801240027p5be6d840h43a04f4ee60398e0@mail.gmail.com>
References: <200801221701.50202.giancarlo.corti@supsi.ch>	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>	<20080122225248.GD1659@mit.edu>	<47969D69.4060607@kadzban.is-a-geek.net>	<20080123031012.GD1320@mit.edu>	<4796B60F.4040009@kadzban.is-a-geek.net>	<20080123081548.GY3180@webber.adilger.int>	<20080123140847.GB29321@mit.edu>	<20080123192334.GG3180@webber.adilger.int>	<4797F397.9020306@kadzban.is-a-geek.net>
	<1d8411e00801240027p5be6d840h43a04f4ee60398e0@mail.gmail.com>
Message-ID: <4798828F.3030303@kadzban.is-a-geek.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Damian Menscher wrote:
> I also have a newbie question: does the fsck of a snapshot really
> catch everything that might be wrong with the drive, or are there
> other failure modes that only a real fsck would catch?

AFAIK, it catches everything.  The LVM2 snapshot is effectively a copy
of the FS at the time the snapshot was taken.

Of course, that could be wrong, but I don't believe so...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHmIKNS5vET1Wea5wRA+dRAJwJUjV4A3O1WjvmOj7EDZgZZJg/hwCeKgt1
yUIA6B+esLW8YFIzyzMWQeY=
=T1+Q
-----END PGP SIGNATURE-----



From adilger at sun.com  Thu Jan 24 15:19:23 2008
From: adilger at sun.com (Andreas Dilger)
Date: Thu, 24 Jan 2008 08:19:23 -0700
Subject: forced fsck (again?)
In-Reply-To: <4798828F.3030303@kadzban.is-a-geek.net>
References: <20080122225248.GD1659@mit.edu>
	<47969D69.4060607@kadzban.is-a-geek.net>
	<20080123031012.GD1320@mit.edu>
	<4796B60F.4040009@kadzban.is-a-geek.net>
	<20080123081548.GY3180@webber.adilger.int>
	<20080123140847.GB29321@mit.edu>
	<20080123192334.GG3180@webber.adilger.int>
	<4797F397.9020306@kadzban.is-a-geek.net>
	<1d8411e00801240027p5be6d840h43a04f4ee60398e0@mail.gmail.com>
	<4798828F.3030303@kadzban.is-a-geek.net>
Message-ID: <20080124151923.GI18433@webber.adilger.int>

On Jan 24, 2008  07:20 -0500, Bryan Kadzban wrote:
> Damian Menscher wrote:
> > At the risk of adding complexity, what about having the SNAPSIZE be
> > automatically determined?  Most users would have no idea what to set
> > it to, and we should be able to guess some reasonable values.  For
> > example, the fsck time can probably be estimated by looking at the
> > number of inodes, how full the filesystem is, etc.  Alternatively, we
> > could just allocate all available space in the LVM.

Yes, this is what my script does, basically guess at a size (1/500th of
the LV size, limited by the amount of free space in the VG).  It should
be possible to override this in a .conf file, but it should be possible
for the majority of systems to run with the defaults.

> > I also have a newbie question: does the fsck of a snapshot really
> > catch everything that might be wrong with the drive, or are there
> > other failure modes that only a real fsck would catch?
> 
> AFAIK, it catches everything.  The LVM2 snapshot is effectively a copy
> of the FS at the time the snapshot was taken.

Yes, it should catch everything.  The snapshot process forces the filesystem
to flush everything to disk in a consistent manner, as if it were unmounted
cleanly and a full copy of the device was made.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From bryan at kadzban.is-a-geek.net  Fri Jan 25 03:20:04 2008
From: bryan at kadzban.is-a-geek.net (Bryan Kadzban)
Date: Thu, 24 Jan 2008 22:20:04 -0500
Subject: forced fsck (again?)
In-Reply-To: <47988245.4010904@kadzban.is-a-geek.net>
References: <479615CA.1090408@redhat.com>	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>	<20080122225248.GD1659@mit.edu>	<47969D69.4060607@kadzban.is-a-geek.net>	<20080123031012.GD1320@mit.edu>	<4796B60F.4040009@kadzban.is-a-geek.net>	<20080123081548.GY3180@webber.adilger.int>	<20080123140847.GB29321@mit.edu>	<20080123192334.GG3180@webber.adilger.int>	<4797F397.9020306@kadzban.is-a-geek.net>	<20080124043930.GG18433@webber.adilger.int>
	<47988245.4010904@kadzban.is-a-geek.net>
Message-ID: <47995564.2050402@kadzban.is-a-geek.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Bryan Kadzban wrote:
> Maybe just using "all available space in the VG" is a better idea
> anyway.

That's what I did here, at least for now.  There's a place in here where
the available space in the VG can be checked, but I'm not sure how to
get that value out of lvs (or vgs) in a format that's easy to parse, so
I skipped it for now as well.  (I could only get values like "250m",
which I assume means 250 megs, but how is the script supposed to handle
the suffixes?)

> I suppose it would work to leave the check interval set in the
> superblock, and avoid using fsck.* -f; that way each fsck would be
> able to determine if it should do a full check or not.

Turns out that will *not* work.  fsck.* without -f will succeed even if
it doesn't check anything (or at least, e2fsck will).  So every day, the
last-check day will get bumped, even though nothing actually got
checked.  That defeats the purpose here.

I've split out the operations of checking the FS, setting the last-check
time to now, setting the last-check time to some time in the ancient
past (if the check fails -- this forces the next-reboot check to be a
full one), and getting the last-check time, into their own functions.
Each one takes a device name and filesystem type argument, and splits
execution paths depending on the FS type.  Adding support for a new FS
(e.g. better support for reiser) should be as easy as modifying the case
statements in four functions.

> It would probably be possible; I'll see what I can find out later
> today.  I have a QEMU VM set up whose root FS is on LVM...

Well, it was set up.  I seem to have somehow nuked the md-raid layer, so
the LVM stuff isn't available anymore.  (It involved a qemu bug (the VM
was running, and suddenly died); then when starting it back up, the
md-raid code started a "background rebuild", and ended up locking up
qemu.  I'll probably have to start over with a new set of image files.)

> We'd still need to find the FS type, although I believe udev provides
> some programs that may be helpful (if we want to rely on them being 
> installed).  volume_id, in particular, should provide that info.

I'm running /lib/udev/vol_id here to get the FS type.  I'm not sure if
that's the best solution or not, but it does work (at least for now).

Anyway, I've also renamed the script from e2check to lvcheck (since it
works for more than ext* now).  Same changelog entry as before, though.

Signed-Off-By: Bryan Kadzban <bryan at kadzban.is-a-geek.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHmVVjS5vET1Wea5wRA6sLAJ472TUX1amJroWIxdGbqQqlLZrS2QCeLHAA
z/fhwCISV3krc/coAmfWlVw=
=5gFW
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lvcheck
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080124/17f705a7/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lvcheck.conf
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080124/17f705a7/attachment.conf>

From jack at suse.cz  Fri Jan 25 16:09:31 2008
From: jack at suse.cz (Jan Kara)
Date: Fri, 25 Jan 2008 17:09:31 +0100
Subject: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66)
In-Reply-To: <20080114131454.37eb7c12@think.oraclecorp.com>
References: <200712242302.lBON2O8s011190@agora.fsl.cs.sunysb.edu>
	<477BF72B.4000608@oracle.com> <20080114170609.GH4214@duck.suse.cz>
	<20080114131454.37eb7c12@think.oraclecorp.com>
Message-ID: <20080125160931.GC1767@duck.suse.cz>

On Mon 14-01-08 13:14:54, Chris Mason wrote:
> On Mon, 14 Jan 2008 18:06:09 +0100
> Jan Kara <jack at suse.cz> wrote:
> > On Wed 02-01-08 12:42:19, Zach Brown wrote:
> > > Erez Zadok wrote:
> > > > Setting: ltp-full-20071031, dio01 test on ext3 with Linus's
> > > > latest tree. Kernel w/ SMP, preemption, and lockdep configured.
> > > 
> > > This is a real lock ordering problem.  Thanks for reporting it.
> > > 
> > > The updating of atime inside sys_mmap() orders the mmap_sem in the
> > > vfs outside of the journal handle in ext3's inode dirtying:
> > > 
> 
> [ lock inversion traces ]
>  
> > > Two fixes come to mind:
> > > 
> > > 1) use something like Peter's ->mmap_prepare() to update atime
> > > before acquiring the mmap_sem.
> > > ( http://lkml.org/lkml/2007/11/11/97 ).  I don't know if this would
> > > leave more paths which do a journal_start() while holding the
> > > mmap_sem.
> > > 
> > > 2) rework ext3's dio to only hold the jbd handle in
> > > ext3_get_block(). Chris has a patch for this kicking around
> > > somewhere but I'm told it has problems exposing old blocks in
> > > ordered data mode.
> > > 
> > > Does anyone have preferences?  I could go either way.  I certainly
> > > don't like the idea of journal handles being held across the
> > > entirety of fs/direct-io.c.  It's yet another case of O_DIRECT
> > > differing wildly from the buffered path :(.
> >   I've looked more into it and I think that 2) is the only way to go
> > since transaction start ranks below page lock (standard buffered
> > write path) and page lock ranks below mmap_sem. So we have at least
> > one more dependency mmap_sem must go before transaction start...
> 
> Just to clarify a little bit:
> 
> If ext3's DIO code only touches transactions in get_block, then it can
> violate data=ordered rules.  Basically the transaction that allocates
> the blocks might commit before the DIO code gets around to writing them.
> 
> A crash in the wrong place will expose stale data on disk.
  Hmm, I've looked at it and I don't think so - look at the rationale in
the patch below... That patch should fix the lock-inversion problem (at
least I see no lockdep warnings on my test machine).

								Honza
-- 
Jan Kara <jack at suse.cz>
SUSE Labs, CR
---

We cannot start transaction in ext3_direct_IO() and just let it last during the
whole write because dio_get_page() acquires mmap_sem which ranks above
transaction start (e.g. because we have dependency chain
mmap_sem->PageLock->journal_start, or because we update atime while holding
mmap_sem) and thus deadlocks could happen. We solve the problem by starting
a transaction separately for each ext3_get_block() call.

We *could* have a problem that we allocate a block and before its data are
written out the machine crashes and thus we expose stale data. But that
does not happen because for hole-filling generic code falls back to buffered
writes and for file extension, we add inode to orphan list and thus in
case of crash, journal replay will truncate inode back to the original size.

Signed-off-by: Jan Kara <jack at suse.cz>
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 9b162cd..5ab7c57 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -941,55 +941,45 @@ out:
 	return err;
 }
 
-#define DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32)
+/* Maximum number of blocks we map for direct IO at once. */
+#define DIO_MAX_BLOCKS 4096
+/*
+ * Number of credits we need for writing DIO_MAX_BLOCKS:
+ * We need sb + group descriptor + bitmap + inode -> 4
+ * For B blocks with A block pointers per block we need:
+ * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect).
+ * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25.
+ */
+#define DIO_CREDITS 25
 
 static int ext3_get_block(struct inode *inode, sector_t iblock,
 			struct buffer_head *bh_result, int create)
 {
 	handle_t *handle = ext3_journal_current_handle();
-	int ret = 0;
+	int ret = 0, started = 0;
 	unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
 
-	if (!create)
-		goto get_block;		/* A read */
-
-	if (max_blocks == 1)
-		goto get_block;		/* A single block get */
-
-	if (handle->h_transaction->t_state == T_LOCKED) {
-		/*
-		 * Huge direct-io writes can hold off commits for long
-		 * periods of time.  Let this commit run.
-		 */
-		ext3_journal_stop(handle);
-		handle = ext3_journal_start(inode, DIO_CREDITS);
-		if (IS_ERR(handle))
+	if (create && !handle) {	/* Direct IO write... */
+		if (max_blocks > DIO_MAX_BLOCKS)
+			max_blocks = DIO_MAX_BLOCKS;
+		handle = ext3_journal_start(inode, DIO_CREDITS +
+				2 * EXT3_QUOTA_TRANS_BLOCKS(sb));
+		if (IS_ERR(handle)) {
 			ret = PTR_ERR(handle);
-		goto get_block;
-	}
-
-	if (handle->h_buffer_credits <= EXT3_RESERVE_TRANS_BLOCKS) {
-		/*
-		 * Getting low on buffer credits...
-		 */
-		ret = ext3_journal_extend(handle, DIO_CREDITS);
-		if (ret > 0) {
-			/*
-			 * Couldn't extend the transaction.  Start a new one.
-			 */
-			ret = ext3_journal_restart(handle, DIO_CREDITS);
+			goto out;
 		}
+		started = 1;
 	}
 
-get_block:
-	if (ret == 0) {
-		ret = ext3_get_blocks_handle(handle, inode, iblock,
+	ret = ext3_get_blocks_handle(handle, inode, iblock,
 					max_blocks, bh_result, create, 0);
-		if (ret > 0) {
-			bh_result->b_size = (ret << inode->i_blkbits);
-			ret = 0;
-		}
+	if (ret > 0) {
+		bh_result->b_size = (ret << inode->i_blkbits);
+		ret = 0;
 	}
+	if (started)
+		ext3_journal_stop(handle);
+out:
 	return ret;
 }
 
@@ -1680,7 +1670,8 @@ static int ext3_releasepage(struct page *page, gfp_t wait)
  * if the machine crashes during the write.
  *
  * If the O_DIRECT write is intantiating holes inside i_size and the machine
- * crashes then stale disk data _may_ be exposed inside the file.
+ * crashes then stale disk data _may_ be exposed inside the file. But current
+ * VFS code falls back into buffered path in that case so we are safe.
  */
 static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
 			const struct iovec *iov, loff_t offset,
@@ -1689,7 +1680,7 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 	struct ext3_inode_info *ei = EXT3_I(inode);
-	handle_t *handle = NULL;
+	handle_t *handle;
 	ssize_t ret;
 	int orphan = 0;
 	size_t count = iov_length(iov, nr_segs);
@@ -1697,17 +1688,21 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
 	if (rw == WRITE) {
 		loff_t final_size = offset + count;
 
-		handle = ext3_journal_start(inode, DIO_CREDITS);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
-			goto out;
-		}
 		if (final_size > inode->i_size) {
+			/* Credits for sb + inode write */
+			handle = ext3_journal_start(inode, 2);
+			if (IS_ERR(handle)) {
+				ret = PTR_ERR(handle);
+				goto out;
+			}
 			ret = ext3_orphan_add(handle, inode);
-			if (ret)
-				goto out_stop;
+			if (ret) {
+				ext3_journal_stop(handle);
+				goto out;
+			}
 			orphan = 1;
 			ei->i_disksize = inode->i_size;
+			ext3_journal_stop(handle);
 		}
 	}
 
@@ -1715,18 +1710,21 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
 				 offset, nr_segs,
 				 ext3_get_block, NULL);
 
-	/*
-	 * Reacquire the handle: ext3_get_block() can restart the transaction
-	 */
-	handle = ext3_journal_current_handle();
-
-out_stop:
-	if (handle) {
+	if (orphan) {
 		int err;
 
-		if (orphan && inode->i_nlink)
+		/* Credits for sb + inode write */
+		handle = ext3_journal_start(inode, 2);
+		if (IS_ERR(handle)) {
+			/* This is really bad luck. We've written the data
+			 * but cannot extend i_size. Bail out and pretend
+			 * the write failed... */
+			ret = PTR_ERR(handle);
+			goto out;
+		}
+		if (inode->i_nlink)
 			ext3_orphan_del(handle, inode);
-		if (orphan && ret > 0) {
+		if (ret > 0) {
 			loff_t end = offset + ret;
 			if (end > inode->i_size) {
 				ei->i_disksize = end;



From adilger at sun.com  Fri Jan 25 00:36:05 2008
From: adilger at sun.com (Andreas Dilger)
Date: Thu, 24 Jan 2008 17:36:05 -0700
Subject: forced fsck (again?)
In-Reply-To: <47988245.4010904@kadzban.is-a-geek.net>
References: <20080122225248.GD1659@mit.edu>
	<47969D69.4060607@kadzban.is-a-geek.net>
	<20080123031012.GD1320@mit.edu>
	<4796B60F.4040009@kadzban.is-a-geek.net>
	<20080123081548.GY3180@webber.adilger.int>
	<20080123140847.GB29321@mit.edu>
	<20080123192334.GG3180@webber.adilger.int>
	<4797F397.9020306@kadzban.is-a-geek.net>
	<20080124043930.GG18433@webber.adilger.int>
	<47988245.4010904@kadzban.is-a-geek.net>
Message-ID: <20080125003605.GP18433@webber.adilger.int>

On Jan 24, 2008  07:19 -0500, Bryan Kadzban wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Andreas Dilger wrote:
> > The problem with this is that ext2/3/4, along with most other
> > filesystems will fail to mount if passed an unknown mount option.
> 
> Uh oh.  Yeah, that's a problem.
> 
> I was under the impression that all the tools would ignore unknown
> options -- if that's not the case, then we probably need to come up with
> something else.  Automatically determining the snapshot size sounds like
> a good idea, but I'm not sure how to do it.  (I'm not sure what decides
> the snapshot size that you need -- it looks like the number of changes
> that you're going to make to the snapshot, or maybe the number of
> changes that you're going to make to both the snapshot and the real LV?

Since we aren't making any changes to the LV it is only the changes that
are made to the original volume that consume space in the volume.

>  In either case, I'm not sure how to find that out.  Maybe just using
> "all available space in the VG" is a better idea anyway.)

I made a wild guess of 1/500 of the total volume size.  Making the snapshot
size a linear function of the volume size makes sense, because the fsck
time is generally linear with the volume size, and the amount of change
in the original volume (and hence the space needed in the snapshot) is
also a linear function of how long the fsck runs.

Having a minimum size for things like the journal, and a maximum size of
the free space in the VG definitely makes sense.

Another thing worth checking in the script is if there is an existing
snapshot volume (maybe left over if the script was interrupted by a crash)
and delete it before recreating the volume.  It also makes sense to have
a very clear name like "{lvname}.fsck.temporary.20080124" that can be
easily seen by the user as not very useful, and can also be deleted by
the script safely.

> True, but what about determining whether it has to run at all (based on
> the last-check time)?  Although, I suppose it would work to leave the
> check interval set in the superblock, and avoid using fsck.* -f; that
> way each fsck would be able to determine if it should do a full check or
> not.

I would just run the script from cron.weekly instead of every night.  If
we miss the check for a few days this isn't harmful, and better than
annoying users.

> Or maybe rewriting in C would work; then I could just use getmntent.
> Although I'm not exactly a fan of writing something like this in C,
> either; shell is more powerful, except for this "reading fstab" thing.

No, I'd rather have a shell script...  Less long-term maintenance.

> > But I've come to think that /etc/fstab is the wrong thing to use for 
> > input.  This script is only useful for LVM volumes, so getting a list
> > of LVs is more appropriate I think.
> 
> True, except the no-LVs behavior of lvscan, lvs, and any of the other
> tools that I was looking at yesterday is decidedly non-optimal.

What is the problem there?  My simple test showed "lvs" on a system
w/o LVM reports "No volume groups found" to stderr, and that can
easily be ignored.

> We'd still need to find the FS type, although I believe udev provides
> some programs that may be helpful (if we want to rely on them being
> installed).  volume_id, in particular, should provide that info.

If it's part of e2fsprogs, then using "blkid" is much better, since it
is also part of e2fsprogs.

	export `blkid -s TYPE $FS | cut -d' ' -f2`

will set an environment variable TYPE={fstype}.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From adilger at sun.com  Fri Jan 25 08:55:57 2008
From: adilger at sun.com (Andreas Dilger)
Date: Fri, 25 Jan 2008 01:55:57 -0700
Subject: forced fsck (again?)
In-Reply-To: <47995564.2050402@kadzban.is-a-geek.net>
References: <47969D69.4060607@kadzban.is-a-geek.net>
	<20080123031012.GD1320@mit.edu>
	<4796B60F.4040009@kadzban.is-a-geek.net>
	<20080123081548.GY3180@webber.adilger.int>
	<20080123140847.GB29321@mit.edu>
	<20080123192334.GG3180@webber.adilger.int>
	<4797F397.9020306@kadzban.is-a-geek.net>
	<20080124043930.GG18433@webber.adilger.int>
	<47988245.4010904@kadzban.is-a-geek.net>
	<47995564.2050402@kadzban.is-a-geek.net>
Message-ID: <20080125085557.GV18433@webber.adilger.int>

On Jan 24, 2008  22:20 -0500, Bryan Kadzban wrote:
> #  Run this from cron each night.  If the machine is on AC power, it
> #  will run the checks; otherwise they will all be skipped.  (If the
> #  script can't tell whether the machine is on AC power, a setting in
> #  the configuration file (/etc/lvcheck.conf) decides whether it will
> #  continue with the checks, or abort.)

Probably once a week is enough, and "/etc/cron.weekly" (anacron) exists
on most systems and will ensure that if the system was off for more than
a week it will still be run on the next boot.

> #  Any LV that passes fsck will have its last-check time updated (in
> #  the real superblock, not the snapshot's superblock); any LV whose
> #  fsck fails will send an email notification to a configurable user
> #  ($EMAIL).  This $EMAIL setting is optional, but its use is highly
> #  recommended, since if any LV fails, it will need to be checked
> #  manually, offline.

I would recommend also using "logger" to log something in /var/log/messages.

> # attempt to force a check of $1 on the next reboot
> function try_force_check() {
> 	local dev="$1"
> 	local fstype="$2"
> 
> 	case "$fstype" in
> 		ext2|ext3)
> 		tune2fs -C 16000 -T "19000101" "$dev"
> 		;;
> 		reiserfs)
> 		# ???
> 		echo "Don't know how to set the last-check time on reiserfs..." >&2
> 		;;
> 		*)
> 		echo "Don't know how to set the last-check time on $fstype..." >&2
> 		;;
> 	esac
> }

These error messages are incorrect, namely "set the last-check time" should
be replaced with "force a check".  Since there isn't any reason to special
case reiserfs here, you may as well remove it.

I suspect that a nice email to the XFS and JFS folks would get them to add
some mechanism to force a filesystem check on the next reboot.

> # check the FS on $1 passively, printing output to $3.
> function perform_check() {
> 	case "$fstype" in
> 		ext2|ext3)
> 		# the only point in fixing anything is just to see if fsck can.
> 		nice logsave -as "${tmpfile}" fsck.${fstype} -p -C 0 "$dev" &&
> 			nice logsave -as "${tmpfile}" fsck.${fstype} -fy -C 0 "$dev"

Hmm, I'm not sure I understand what it is you want to do?  The fsck should
be run as 'e2fsck -fn "$dev"' (since we already know this is ext2|ext3).
Using "-C 0" isn't useful because we don't want progress in the output log,
and "-p" without "-f" will just check the superblock.  We don't want to be
fixing anything (since this should be a read-only snapshot) so "-fy" is 
also not so great.

> # do everything needed to check and reset dates and counters on /dev/$1/$2.
> function check_fs() {
> 	local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX`
> 	trap "rm $tmpfile ; trap - RETURN" RETURN

For the log file it probably makes sense to keep this around with a
timestamp if there is a failure.  That means it is fine to generate a
random filename temporarily, but it should be renamed to something
meaningful (e.g. /var/log/lvfsck.$dev.$(date +%Y%m%d) or similar).

> 	# only one check happens at a time; using all the free space in the VG
> 	#  at least won't prevent other checks from happening...
> 	lvcreate -s -l "100%FREE" -n "${lv}-snap" "${vg}/${lv}"

To find free space, use "vgs -o vg_size --noheadings ${vg}", and the
LV size can be had from "lvs -o lv_size --noheadings ${vg}/${lv}".
You can strip the size suffixes with "--units M --nosuffix" to get
units of MB.

Also good to create a more unique name than "${lv}-snap", since that
might conflict with an existing snapshot, and if the script crashes
the user might be wondering if that LV using 100% of the free space is
safe to delete or not.

Please also add XFS support here, having it call "xfs_check", since
fsck.xfs is an empty shell...

For JFS it can also use "fsck.jfs -fn $dev" to check the filesystem.

> 	if perform_check "/dev/${vg}/${lv}-snap" "${fstype}" "${tmpfile}" ; then
> 		echo 'Background scrubbing succeeded!'
> 		try_delay_checks "/dev/${vg}/${lv}" "$fstype"
> 	else
> 		echo 'Background scrubbing failed! Reboot to fsck soon!'

Printing the device name in these messages, and sending them to the syslog
via logger would probably be more useful.

> 		try_force_check "/dev/${vg}/${lv}" "$fstype"
> 
> 		if test -n "$EMAIL"; then
> 			mail -s "Fsck of /dev/${vg}/${lv} failed!" $EMAIL < $tmpfile
> 		fi
>
> set -e

Have you verified that the script doesn't exit if an fsck fails with an
error?

> # pull in configuration -- don't bother with a parser, just use the shell's
> . /etc/lvcheck.conf

You should check that this file exists before sourcing it, or the script will
exit with an error:

[ -r /etc/lvcheck.conf ] && . /etc/lvcheck.conf

> # parse up lvscan output
> lvscan 2>&1 | grep ACTIVE | awk '{print $2;}' | \
> while read DEV ; do
> 	# remove the single quotes around the device name
> 	DEV="`echo "$DEV" | tr -d \'`"
> 
> 	# get the FS type
> 	FSTYPE="`/lib/udev/vol_id -t "$DEV"`"

Please use "blkid", since that is part of e2fsprogs already and avoids
an extra dependency.

> 	# if the date is unknown, run fsck every day.  sigh.

Better to write "run fsck each time the script is run".

> 	# get the free space
> 	SPACE="`lvs --noheadings -o vg_free "$DEV"`"
> 
> 	# ensure that some free space exists, at least
> 	#  ??? -- can lvs print vg_free in plain numbers, or do I have to
> 	#  figure out what a suffix of "m" means?  skip the check for now.

"vgs", and --nosuffix, per above.

> #!/bin/sh
> 
> # e2check configuration variables:
> #
> #  EMAIL
> #   Address to send failure notifications to.  If empty,
> #   failure notifications will not be sent.
> #
> #  INTERVAL
> #   Days to wait between checks.  All LVs use the same
> #   INTERVAL, but the "days since last check" value can
> #   be different per LV, since that value is stored in
> #   the ext2/ext3 superblock.
> #
> #  AC_UNKNOWN
> #   Whether to run the e2fsck checks if the script can't
> #   determine whether the machine is on AC power.  Laptop
> #   users will want to set this to ABORT, while server and
> #   desktop users will probably want to set this to
> #   CONTINUE.  Those are the only two valid values.
> 
> EMAIL='root'
> INTERVAL=30
> AC_UNKNOWN="ABORT"

I would also make these all be defaults in the script (before this file is
parsed), so it works as expected if /etc/lvscan.conf doesn't exist.

I'd also recommend that the default for AC_UNKNOWN be CONTINUE (or possibly
leave it unset by default and have the script not error out in this case,
so that the script does something useful for the majority of users.

If we are worried about the laptop case, we could add checks to see
if the system has a PC card, since very few desktop systems have them.
Both the commands "pccardctl info" and "cardctl info" produce no output
on stdout if there is no PC card slot, and this could be used to decide
between "CONTINUE" for desktops and "ABORT" for laptops.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From bryan at kadzban.is-a-geek.net  Sat Jan 26 02:02:56 2008
From: bryan at kadzban.is-a-geek.net (Bryan Kadzban)
Date: Fri, 25 Jan 2008 21:02:56 -0500
Subject: forced fsck (again?)
In-Reply-To: <20080125085557.GV18433@webber.adilger.int>
References: <47969D69.4060607@kadzban.is-a-geek.net>	<20080123031012.GD1320@mit.edu>	<4796B60F.4040009@kadzban.is-a-geek.net>	<20080123081548.GY3180@webber.adilger.int>	<20080123140847.GB29321@mit.edu>	<20080123192334.GG3180@webber.adilger.int>	<4797F397.9020306@kadzban.is-a-geek.net>	<20080124043930.GG18433@webber.adilger.int>	<47988245.4010904@kadzban.is-a-geek.net>	<47995564.2050402@kadzban.is-a-geek.net>
	<20080125085557.GV18433@webber.adilger.int>
Message-ID: <479A94D0.9080308@kadzban.is-a-geek.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Andreas Dilger wrote:
> On Jan 24, 2008  22:20 -0500, Bryan Kadzban wrote:
>> #  Run this from cron each night.
> 
> Probably once a week is enough, and "/etc/cron.weekly" (anacron) exists
> on most systems and will ensure that if the system was off for more than
> a week it will still be run on the next boot.

Yeah, it's probably true that once per week is enough.  Do you think it
would still make sense to try and parse out the last-check time from the
LV if this gets run each week, or just unconditionally check everything
(if on AC)?  Checking everything weekly might be too often (especially
if the extra disk usage ends up exposing bad bits on a disk), but maybe
not.

> I would recommend also using "logger" to log something in /var/log/messages.

Yeah, that makes sense.  logger is part of util-linux{,-ng}, so that's
not a huge extra dependency either.

>> echo "Don't know how to set the last-check time on $fstype..." >&2
> 
> These error messages are incorrect, namely "set the last-check time" should
> be replaced with "force a check".

That's true.  I was trying to get the errors to refer to what specific
information needed to be added to the script (in this case, it needs to
know how to set the last-check time), but "force a check" is probably
safer anyway.  Setting the last-check time may not be the method that
every FS uses.

> Since there isn't any reason to special
> case reiserfs here, you may as well remove it.

That's what I get for deciding to handle reiser separately everywhere,
and then changing my mind later -- I forgot to go back and remove this
case.  Oops...  :-)

> I suspect that a nice email to the XFS and JFS folks would get them to add
> some mechanism to force a filesystem check on the next reboot.

Is the issue that those FSes don't have any such mechanism today, or is
it just that I don't know how to do this on them?

(I'll have to go look up the XFS/JFS lists, too, but that's not terribly
difficult.)

>> nice logsave -as "${tmpfile}" fsck.${fstype} -p -C 0 "$dev" &&
>> 	nice logsave -as "${tmpfile}" fsck.${fstype} -fy -C 0 "$dev"
> 
> Hmm, I'm not sure I understand what it is you want to do?

Well, neither do I, necessarily -- those arguments were copied from the
initial script that I hacked the extra stuff into (the one that Ted
posted at the start of this whole thing).  :-)

I see that your script just uses -fn; that's probably simpler anyway.
What it doesn't determine is whether fsck would be able to automatically
repair the damage that it finds; I guess the question is whether this
condition should be treated as a fsck failure (requiring a reboot to
fix) or not.  It probably depends on the severity of the fixes that fsck
makes...

OTOH, if you give e2fsck the -fy option, and it does make changes, its
exit status will not be zero, so it will already be treated as a failure
by this script.  So the only difference is that -fn stops it from
writing to the snapshot just to have the writes thrown away; that's
probably actually good.

> and "-p" without "-f" will just check the superblock.

Yeah, I think the idea was to check the superblock first, and then check
the rest of the FS.  But I think -fn is probably more explicit about
what we want fsck to do, too.

(Plus, even if we do take a read-write snapshot with LVM2, there's no
point in taking up extra space by writing to the snapshot itself, if
it's just going to get thrown away.)

> For the log file it probably makes sense to keep this around with a
> timestamp if there is a failure.

And let e.g. logrotate get rid of older versions; yeah, that makes
sense.

> To find free space, use "vgs -o vg_size --noheadings ${vg}", and the
> LV size can be had from "lvs -o lv_size --noheadings ${vg}/${lv}".

Free space can also be retrieved with -o vg_free, but yeah.

> You can strip the size suffixes with "--units M --nosuffix" to get
> units of MB.

Ah, that was the bit I was missing yesterday (further down in the
script): --nosuffix.  Thanks!

I also just got your message from yesterday about the guess behind the
<LV size/500> (based on the frequency of writes to the main LV); that
makes sense.  And since I can get the size out of lvs, that makes that
much easier, too, so I'll just use 1/500th the LV size.

> Also good to create a more unique name than "${lv}-snap", since that
> might conflict with an existing snapshot, and if the script crashes
> the user might be wondering if that LV using 100% of the free space is
> safe to delete or not.

Yeah, that was left over from the original script as well.  Changing it
makes sense.

> Please also add XFS support here,

Done, I think.  I assume xfs_check doesn't need any args?

(Should fsck.xfs perhaps just exec xfs_check and pass it all the args?
That's a whole separate discussion, probably.)

> For JFS it can also use "fsck.jfs -fn $dev" to check the filesystem.

Done.

>> echo 'Background scrubbing succeeded!'
>> echo 'Background scrubbing failed! Reboot to fsck soon!'
> 
> Printing the device name in these messages, and sending them to the syslog
> via logger would probably be more useful.

True; done.  The severity may need a bit of tweaking, but hopefully not
much.

>> set -e
> 
> Have you verified that the script doesn't exit if an fsck fails with an
> error?

No, the script exits if fsck fails with an error.  That's obviously bad
- -- I wasn't thinking that far ahead when I added that.  It's gone now.

>> . /etc/lvcheck.conf
> 
> You should check that this file exists before sourcing it, or the script will
> exit with an error

That was intended; I figured the config file would be required (back
when I first added it).  But since we have decent default values for the
settings in it, it probably makes sense to make it optional now.

>> FSTYPE="`/lib/udev/vol_id -t "$DEV"`"
> 
> Please use "blkid", since that is part of e2fsprogs already and avoids
> an extra dependency.

True.  Looking at the manpages, it appears that vol_id does some extra
checks to try to detect RAID members as RAID members, instead of
partitions containing a filesystem.  But that would only affect this
script if someone had multiple LVs RAIDed together, and I doubt that's
well-supported elsewhere, so blkid is fine.

>> # if the date is unknown, run fsck every day.  sigh.
> 
> Better to write "run fsck each time the script is run".

Yeah, that makes more sense.

>> #  ??? -- can lvs print vg_free in plain numbers, or do I have to
>> #  figure out what a suffix of "m" means?  skip the check for now.
> 
> "vgs", and --nosuffix, per above.

Yep, done.

>> EMAIL='root'
>> INTERVAL=30
>> AC_UNKNOWN="ABORT"
> 
> I would also make these all be defaults in the script (before this file is
> parsed), so it works as expected if /etc/lvscan.conf doesn't exist.

Since it's now optional, yes, that makes sense.

> I'd also recommend that the default for AC_UNKNOWN be CONTINUE (or possibly
> leave it unset by default and have the script not error out in this case,
> so that the script does something useful for the majority of users.

Well, it depends on whether the majority of users have laptops, or some
other hardware type (desktops, servers, etc.).  I was thinking that
laptops would be more prevalent, but since this is Linux, it's probably
actually servers.  OK -- CONTINUE it is, by default.

> If we are worried about the laptop case, we could add checks to see
> if the system has a PC card, since very few desktop systems have them.
> Both the commands "pccardctl info" and "cardctl info" produce no output
> on stdout if there is no PC card slot, and this could be used to decide
> between "CONTINUE" for desktops and "ABORT" for laptops.

Or stuff it into comments in the config file.  Pushing the decision back
onto the user makes me a bit uncomfortable, but fuzzy decisions (ones
that aren't necessarily based on the right info) make me even less
comfortable.  Hmm.  And depending how the power_supply sysfs class ends
up working, maybe this is all a moot point anyway: if it always has
devices under it on >=2.6.24, then the setting won't even matter.

For now, I'll just leave the default CONTINUE, but with comments in the
config file aimed at laptop users.

- ----

Create a script to transparently run fsck in the background on any
active LVM logical volumes, as long as the machine is on AC power, and
that LV has been last checked more than a configurable number of days
ago.  Also create an optional configuration file to set various options
in the script.

Signed-Off-By: Bryan Kadzban <bryan at kadzban.is-a-geek.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHmpTOS5vET1Wea5wRA2XXAKCZzt9SEOSBVs4EkrI4gt3Ztl0v5wCg3gq5
1ChmnEccT+hFVo/2B/RpU8U=
=D4HV
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lvcheck
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080125/12ce214b/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lvcheck.conf
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080125/12ce214b/attachment.conf>

From tytso at mit.edu  Sat Jan 26 04:33:34 2008
From: tytso at mit.edu (Theodore Tso)
Date: Fri, 25 Jan 2008 23:33:34 -0500
Subject: forced fsck (again?)
In-Reply-To: <20080125085557.GV18433@webber.adilger.int>
References: <20080123031012.GD1320@mit.edu>
	<4796B60F.4040009@kadzban.is-a-geek.net>
	<20080123081548.GY3180@webber.adilger.int>
	<20080123140847.GB29321@mit.edu>
	<20080123192334.GG3180@webber.adilger.int>
	<4797F397.9020306@kadzban.is-a-geek.net>
	<20080124043930.GG18433@webber.adilger.int>
	<47988245.4010904@kadzban.is-a-geek.net>
	<47995564.2050402@kadzban.is-a-geek.net>
	<20080125085557.GV18433@webber.adilger.int>
Message-ID: <20080126043334.GB28889@mit.edu>

On Fri, Jan 25, 2008 at 01:55:57AM -0700, Andreas Dilger wrote:
> > 		nice logsave -as "${tmpfile}" fsck.${fstype} -p -C 0 "$dev" &&
> > 			nice logsave -as "${tmpfile}" fsck.${fstype} -fy -C 0 "$dev"
> 
> Hmm, I'm not sure I understand what it is you want to do?  The fsck should
> be run as 'e2fsck -fn "$dev"' (since we already know this is ext2|ext3).
> Using "-C 0" isn't useful because we don't want progress in the output log,

This was my fault.  It means that when you run this from a tty, you
get to see the progress bar.  The -s flag to logsave will strip out
the progress information.  (I added logsave -s precisely for this
purpose.  :-)

> and "-p" without "-f" will just check the superblock.  

That's needed e2fsck -p will clean up the orphaned inode list, so that
the subsequent e2fsck -fy will return 0 if the filesystem is clean.
Without the the fsck -p, then e2fsck -fy will return 1 (because it
modified the filesystem) which we can't distinguish from the case
where the filesystem had errors.

> We don't want to be
> fixing anything (since this should be a read-only snapshot) so "-fy" is 
> also not so great.

This is a tradeoff.  e2fsck -fy requires that the snapshot have more
space (although if you run off, it's not that horrible; the snapshot
will just go invalid).  The advantage of "-fy" is that you get more
information about any errors in the filesystem, where as "-fn" may not
report as useful information.

> > # do everything needed to check and reset dates and counters on /dev/$1/$2.
> > function check_fs() {
> > 	local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX`
> > 	trap "rm $tmpfile ; trap - RETURN" RETURN
> 
> For the log file it probably makes sense to keep this around with a
> timestamp if there is a failure.  That means it is fine to generate a
> random filename temporarily, but it should be renamed to something
> meaningful (e.g. /var/log/lvfsck.$dev.$(date +%Y%m%d) or similar).

The idea is if there is a failure we'll e-mail to the administrator;
after that, there's no real need to keep it around.

      	    	       	    	    	 - Ted



From bryan at kadzban.is-a-geek.net  Tue Jan 29 00:56:50 2008
From: bryan at kadzban.is-a-geek.net (Bryan Kadzban)
Date: Mon, 28 Jan 2008 19:56:50 -0500
Subject: forced fsck (again?)
In-Reply-To: <20080128174804.GT18433@webber.adilger.int>
References: <4796B60F.4040009@kadzban.is-a-geek.net>
	<20080123081548.GY3180@webber.adilger.int>
	<20080123140847.GB29321@mit.edu>
	<20080123192334.GG3180@webber.adilger.int>
	<4797F397.9020306@kadzban.is-a-geek.net>
	<20080124043930.GG18433@webber.adilger.int>
	<47988245.4010904@kadzban.is-a-geek.net>
	<47995564.2050402@kadzban.is-a-geek.net>
	<20080125085557.GV18433@webber.adilger.int>
	<479A94D0.9080308@kadzban.is-a-geek.net>
	<20080128174804.GT18433@webber.adilger.int>
Message-ID: <479E79D2.5070406@kadzban.is-a-geek.net>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Andreas Dilger wrote:
> On Jan 25, 2008  21:02 -0500, Bryan Kadzban wrote:
>> logger $arg -p user."$sev" -- "$msg"
> 
> This should use "-t lvcheck" so that it reports what program is generating
> the message.

Yep, that'd be useful.

>> tune2fs -C 16000 -T "19000101" "$dev"
> 
> I'm a tiny bit reluctant to overwrite the "last checked" date, since this
> might be useful information for the administrator (i.e. it will tell the
> interval wherein the corruption was detected).  Setting the "mount count"
> is enough to force a check, and the mount count itself can be reverse
> engineered from "reboot" messages in the "last" log.

Assuming the user doesn't set a maximum mount count higher than 16000
(but I think that's highly unlikely).  I think the benefit of being able
to know (approximately) when corruption started is probably worth it,
though.

> It is a lot clearer if the "cases" (ext2|ext3|ext4) are aligned with the
> "case" statement,

I see what you mean.  The script just uses vim's default autoindent
levels, but I can change the cases.

>> reiserfs)
>> 	# do nothing?
> 
> I thought you were going to remove the empty reiserfs cases?

Er, I was; I think I was looking at the wrong case last time around.
This one's gone now as well.

>> local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX`
> 
> Shouldn't be "e2fsck.log"?  Maybe "lvcheck.log.XXXXXXXXX"?

Yeah, that'd be better; that's more leftover code from the original script.

>> # Assume the script won't run more than one instance at a time?
>> lvremove -f "${lvtemp##/dev}"
> 
> Should check the error return and bail out of script if there is an error.

Will that catch the "more than one instance at a time" case (e.g. if
another script run is still running e2fsck on this snapshot)?  Assuming
lvremove can fail (and it probably can), it's probably a good idea to
check it in any case, but if running e2fsck makes lvremove fail (until
e2fsck finishes), that's a decent way to get rid of the comment too.

Also, I think it'd be better to skip just the current FS, rather than an
"exit 1" type bail-out, right?

> MINFREE=0	# megabytes to leave free in each volume group
> MINSNAP=256	# megabytes for minimum snapshot size.

I've added something very similar to this logic, but I changed the
checks around a bit.  I think it makes more sense this way (doing the
overall space check first, and then the limits second), unless this
logic disallows some valid combinations?

(Still trying to decide how to handle logging *fsck output, and what to
do with the file, based on your other message...)

- -----

Create a script to transparently run fsck in the background on any
active LVM logical volumes, as long as the machine is on AC power, and
that LV has been last checked more than a configurable number of days
ago.  Also create an optional configuration file to set various options
in the script.

Signed-Off-By: Bryan Kadzban <bryan at kadzban.is-a-geek.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHnnnRS5vET1Wea5wRAw0iAJ9wcLyfBSaH5FSIJNH0YakzDCUvjwCgnJEH
lPScP39vBYIIjOQPiftgDs8=
=XjFF
-----END PGP SIGNATURE-----
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lvcheck
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080128/e606b548/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lvcheck.conf
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080128/e606b548/attachment.conf>

From sandeen at redhat.com  Tue Jan 29 02:42:11 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Mon, 28 Jan 2008 20:42:11 -0600
Subject: forced fsck (again?)
In-Reply-To: <479E79D2.5070406@kadzban.is-a-geek.net>
References: <4796B60F.4040009@kadzban.is-a-geek.net>	<20080123081548.GY3180@webber.adilger.int>	<20080123140847.GB29321@mit.edu>	<20080123192334.GG3180@webber.adilger.int>	<4797F397.9020306@kadzban.is-a-geek.net>	<20080124043930.GG18433@webber.adilger.int>	<47988245.4010904@kadzban.is-a-geek.net>	<47995564.2050402@kadzban.is-a-geek.net>	<20080125085557.GV18433@webber.adilger.int>	<479A94D0.9080308@kadzban.is-a-geek.net>	<20080128174804.GT18433@webber.adilger.int>
	<479E79D2.5070406@kadzban.is-a-geek.net>
Message-ID: <479E9283.5000001@redhat.com>

Some hints for xfs, which does not enforce check intervals, so:

- no mechanism or need to delay next check
- no mechanism to enforce check on next boot; just notify w/ email
- no mechanism to read last-checked; just check on acceptable cron 
  interval

Also, you really want to use xfs_repair -n instead of xfs_check; 
it's much faster and more memory-efficient.

So most of the xfs) cases are just documenting that xfs can't 
and/or doesn't need to do anything, they don't really need to 
be there - up to you. :)

-Eric


--- lvcheck.orig	2008-01-28 20:23:16.000000000 -0600
+++ lvcheck	2008-01-28 20:40:25.000000000 -0600
@@ -111,6 +111,9 @@
 	ext2|ext3)
 		tune2fs -C 16000 "$dev"
 		;;
+	xfs)
+		# XFS does not enforce check intervals; let email suffice.
+		;;
 	*)
 		log "warning" "Don't know how to force a check on $fstype..."
 		;;
@@ -126,6 +129,9 @@
 	ext2|ext3)
 		tune2fs -C 0 -T now "$dev"
 		;;
+	xfs)
+		# XFS does not enforce check intervals; nothing to delay
+		;;
 	*)
 		log "warning" "Don't know how to delay checks on $fstype..."
 		;;
@@ -143,6 +149,10 @@
 		dumpe2fs -h "$dev" 2>/dev/null | grep 'Last checked:' | \
 				sed -e 's/Last checked:[[:space:]]*//'
 		;;
+	xfs)
+		# XFS does not save last-checked; just check on cron interval
+		echo "Unknown"
+		;;
 	*)
 		# TODO: add support for various FSes here
 		echo "Unknown"
@@ -167,7 +177,7 @@
 		return 0
 		;;
 	xfs)
-		nice logsave -as "${tmpfile}" xfs_check "$dev"
+		nice logsave -as "${tmpfile}" xfs_repair -n "$dev"
 		return $?
 		;;
 	jfs)




From sandeen at redhat.com  Tue Jan 29 03:39:26 2008
From: sandeen at redhat.com (Eric Sandeen)
Date: Mon, 28 Jan 2008 21:39:26 -0600
Subject: forced fsck (again?)
In-Reply-To: <479749A1.5040208@redhat.com>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
	<20080123091601.GZ3180@webber.adilger.int>
	<479749A1.5040208@redhat.com>
Message-ID: <479E9FEE.4020506@redhat.com>

Eric Sandeen wrote:
> Andreas Dilger wrote:
> 
>> Maybe some of the distro folks (Eric? :-) will pick up on this thread and
>> consider adding the "e2fsck snapshot" script to cron.monthly or similar.
> 
> I'm watching.... sure, that might be a candidate for Fedora.  Ideally
> it'd be part of e2fsprogs

Er, I guess it really doesn't need to be in e2fsprogs, does it, since
it's extending to cover other fs's; it could stand on its own, or maybe
even be part of the init infrastructure.  I'll ask the folks who own
init; otherwise we could package it up on its own.

-Eric



From adilger at sun.com  Mon Jan 28 17:48:04 2008
From: adilger at sun.com (Andreas Dilger)
Date: Mon, 28 Jan 2008 10:48:04 -0700
Subject: forced fsck (again?)
In-Reply-To: <479A94D0.9080308@kadzban.is-a-geek.net>
References: <4796B60F.4040009@kadzban.is-a-geek.net>
	<20080123081548.GY3180@webber.adilger.int>
	<20080123140847.GB29321@mit.edu>
	<20080123192334.GG3180@webber.adilger.int>
	<4797F397.9020306@kadzban.is-a-geek.net>
	<20080124043930.GG18433@webber.adilger.int>
	<47988245.4010904@kadzban.is-a-geek.net>
	<47995564.2050402@kadzban.is-a-geek.net>
	<20080125085557.GV18433@webber.adilger.int>
	<479A94D0.9080308@kadzban.is-a-geek.net>
Message-ID: <20080128174804.GT18433@webber.adilger.int>

On Jan 25, 2008  21:02 -0500, Bryan Kadzban wrote:
> > I suspect that a nice email to the XFS and JFS folks would get them to add
> > some mechanism to force a filesystem check on the next reboot.
> 
> Is the issue that those FSes don't have any such mechanism today, or is
> it just that I don't know how to do this on them?

I don't think they have any such mechanism (at least not one that I know
about), but I think they will find it useful to add.

> (Should fsck.xfs perhaps just exec xfs_check and pass it all the args?
> That's a whole separate discussion, probably.)

Right...

> Create a script to transparently run fsck in the background on any
> active LVM logical volumes, as long as the machine is on AC power, and
> that LV has been last checked more than a configurable number of days
> ago.  Also create an optional configuration file to set various options
> in the script.
> 
> Signed-Off-By: Bryan Kadzban <bryan at kadzban.is-a-geek.net>

> #!/bin/sh
> #
> # lvcheck
> 
> # send $2 to syslog, with severity $1
> # severities are emerg/alert/crit/err/warning/notice/info/debug
> function log() {
> 	local sev="$1"
> 	local msg="$2"
> 	local arg=
> 
> 	# log warning-or-higher messages to stderr as well
> 	[ "$sev" == "emerg" || "$sev" == "alert" || "$sev" == "crit" || \
> 			"$sev" == "err" || "$sev" == "warning" ] && arg=-s
> 
> 	logger $arg -p user."$sev" -- "$msg"
> }

This should use "-t lvcheck" so that it reports what program is generating
the message.

> # attempt to force a check of $1 on the next reboot
> function try_force_check() {
> 	local dev="$1"
> 	local fstype="$2"
> 
> 	case "$fstype" in
> 		ext2|ext3)
> 		tune2fs -C 16000 -T "19000101" "$dev"

I'm a tiny bit reluctant to overwrite the "last checked" date, since this
might be useful information for the administrator (i.e. it will tell the
interval wherein the corruption was detected).  Setting the "mount count"
is enough to force a check, and the mount count itself can be reverse
engineered from "reboot" messages in the "last" log.

> # attempt to set the last-check time on $1 to now, and the mount count to 0.
> function try_delay_checks() {
> 	local dev="$1"
> 	local fstype="$2"
> 
> 	case "$fstype" in
> 		ext2|ext3)

It is a lot clearer if the "cases" (ext2|ext3|ext4) are aligned with the
"case" statement, like below, since that provides a better separation:

	case "$fstype" in
	ext2|ext3|ext4)
		tune2fs -C 0 -T now "$dev"
		;;

>	reiserfs)
>		# do nothing?
		;;

I thought you were going to remove the empty reiserfs cases?

> # check the FS on $1 passively, saving output to $3.
> function perform_check() {
> 	local dev="$1"
> 	local fstype="$2"
> 	local tmpfile="$3"
> 
> 	case "$fstype" in
> 		ext2|ext3)

Ditto on indenting the cases.

> # do everything needed to check and reset dates and counters on /dev/$1/$2.
> function check_fs() {
> 	local vg="$1"
> 	local lv="$2"
> 	local fstype="$3"
> 	local snapsize="$4"
> 
> 	local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX`

Shouldn't be "e2fsck.log"?  Maybe "lvcheck.log.XXXXXXXXX"?

> 	local errlog="/var/log/lvcheck-${vg}@${lv}-`date +'%Y%m%d'`"
> 	local snaplvbase="${lv}-lvcheck-temp"
> 	local snaplv="${snaplvbase}-`date +'%Y%m%d'`"
> 
> 	# clean up any left-over snapshot LVs
> 	for lvtemp in /dev/${vg}/${snaplvbase}* ; do
> 		if [ -e "$lvtemp" ] ; then
> 			# Assume the script won't run more than one instance at a time?
> 			lvremove -f "${lvtemp##/dev}"

Should check the error return and bail out of script if there is an error.

> # parse up lvscan output
> lvscan 2>&1 | grep ACTIVE | awk '{print $2;}' | \
> while read DEV ; do
> 
> 	if [ "$SNAPSIZE" -gt "$SPACE" ] ; then
> 		log "err" "Can't take a snapshot of $DEV: not enough free space in the VG."
> 		continue

Well, the 1/500 rule is only a guideline.  For example, I have a huge
filesystem for TV shows, but it doesn't change that often, so it would
make more sense to just reduce $SNAPSIZE to $SPACE (assuming some minimum
amount of free space is available).

Make a default, that is settable in the .conf file:
	MINFREE=0	# megabytes to leave free in each volume group
	MINSNAP=256	# megabytes for minimum snapshot size.

	# make snapshot large enough to handle e.g. journal and other updates
	[ $SNAPSIZE -lt $MINSNAP ] && SNAPSIZE=$MINSNAP

	# limit snapshot to available space
	[ $SNAPSIZE -gt $((SPACE - MINFREE)) ] && SNAPSIZE=$((SPACE - MINFREE))

	# if we don't have enough space, skip this check
	if [ $SNAPSIZE -lt $MINSNAP ]; then
		log "warning" "Check of $LV can't get ${SNAPSIZE}MB, skipping"
		continue
	fi


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From adilger at sun.com  Mon Jan 28 20:59:19 2008
From: adilger at sun.com (Andreas Dilger)
Date: Mon, 28 Jan 2008 13:59:19 -0700
Subject: Integrating patches in SLES10 e2fsprogs
In-Reply-To: <479E08D5.3040609@redhat.com>
References: <20080124211728.GA24900@webber.adilger.int>
	<n7xir1iq69b.fsf@sor.suse.de> <20080127050543.GC24842@mit.edu>
	<n7x1w82nf7m.fsf@sor.suse.de> <20080128153802.GB17752@mit.edu>
	<479E08D5.3040609@redhat.com>
Message-ID: <20080128205919.GW18433@webber.adilger.int>

On Jan 28, 2008  10:54 -0600, Eric Sandeen wrote:
> Theodore Tso wrote:
> > On Mon, Jan 28, 2008 at 04:26:53PM +0100, Matthias Koenig wrote:
> >>> Patch6:         e2fsprogs-mdraid.patch
> >>>
> >>> This apparently adds a new environment variable,
> >>> BLKID_SKIP_CHECK_MDRAID, which forces blkid to not detect mdraid
> >>> devices.  I'm not sure why.
> >> Workaround for people having stale RAID signature on their disk:
> >> https://bugzilla.novell.com/show_bug.cgi?id=100530
> > 
> > Hmm... there's got to be a better way around this.
> 
> Won't help existing block devices, but it'd be nice to have a common
> library which could be called @ mkfs time to wipe out all known
> signatures...
> 
> mkfs.xfs tries to do this, but it'd be silly to duplicate in every mkfs.

Well, blkid already has a way to _detect_ a lot of filesystem signatures,
so it might be relatively easy to exploit the type_array[] entries to have
it zap out all of these blocks.  That said, the majority of them are in
the first 68kB of the filesystem (mdraid excluded) so it shouldn't be too
hard to zero them out.  Let's hope nobody ever uses "0x00000000" as magic.

mke2fs already tries to do this, though I notice:
- the zap_sector() call will skip the entire write if there is a BSD bootblock,
  instead of skipping only the first sector(s) and overwriting the rest...
  Since I don't know much about BSD bootblocks, I don't know what the right
  behaviour is, but I can guess we still want to zero out 4-68kB (or whatever).
- it only overwrites up to sector 8 (4kB) and not further into the disk to
  catch e.g. reiserfs superblocks.  Usually it will overwrite this anyways
  (GDT, bitmaps, inode table), but in some rare cases it might not.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From adilger at sun.com  Mon Jan 28 17:52:16 2008
From: adilger at sun.com (Andreas Dilger)
Date: Mon, 28 Jan 2008 10:52:16 -0700
Subject: forced fsck (again?)
In-Reply-To: <20080126043334.GB28889@mit.edu>
References: <4796B60F.4040009@kadzban.is-a-geek.net>
	<20080123081548.GY3180@webber.adilger.int>
	<20080123140847.GB29321@mit.edu>
	<20080123192334.GG3180@webber.adilger.int>
	<4797F397.9020306@kadzban.is-a-geek.net>
	<20080124043930.GG18433@webber.adilger.int>
	<47988245.4010904@kadzban.is-a-geek.net>
	<47995564.2050402@kadzban.is-a-geek.net>
	<20080125085557.GV18433@webber.adilger.int>
	<20080126043334.GB28889@mit.edu>
Message-ID: <20080128175216.GU18433@webber.adilger.int>

On Jan 25, 2008  23:33 -0500, Theodore Tso wrote:
> > Hmm, I'm not sure I understand what it is you want to do?  The fsck should
> > be run as 'e2fsck -fn "$dev"' (since we already know this is ext2|ext3).
> > Using "-C 0" isn't useful because we don't want progress in the output log,
> 
> This was my fault.  It means that when you run this from a tty, you
> get to see the progress bar.  The -s flag to logsave will strip out
> the progress information.  (I added logsave -s precisely for this
> purpose.  :-)

OK, that is fine too, I wasn't sure if it would fill the log with "===".

> > and "-p" without "-f" will just check the superblock.  
> 
> That's needed e2fsck -p will clean up the orphaned inode list, so that
> the subsequent e2fsck -fy will return 0 if the filesystem is clean.
> Without the the fsck -p, then e2fsck -fy will return 1 (because it
> modified the filesystem) which we can't distinguish from the case
> where the filesystem had errors.

Hmm, shouldn't that be cleaned up when making a snapshot?  If not, then
we are stuck with the problem that you have to have writable snapshots,
and that is less desirable than read-only snapshots, but not fatal I guess.

> > We don't want to be fixing anything (since this should be a read-only
> > snapshot) so "-fy" is  also not so great.
> 
> This is a tradeoff.  e2fsck -fy requires that the snapshot have more
> space (although if you run off, it's not that horrible; the snapshot
> will just go invalid).

Well, in my one experiment this caused the lvcheck to be unkillable, and
also marked the parent offline...  Maybe it was just that one time (I
haven't tested extensively).

> The advantage of "-fy" is that you get more
> information about any errors in the filesystem, where as "-fn" may not
> report as useful information.

True.

> > > # do everything needed to check and reset dates and counters on /dev/$1/$2.
> > > function check_fs() {
> > > 	local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX`
> > > 	trap "rm $tmpfile ; trap - RETURN" RETURN
> > 
> > For the log file it probably makes sense to keep this around with a
> > timestamp if there is a failure.  That means it is fine to generate a
> > random filename temporarily, but it should be renamed to something
> > meaningful (e.g. /var/log/lvfsck.$dev.$(date +%Y%m%d) or similar).
> 
> The idea is if there is a failure we'll e-mail to the administrator;
> after that, there's no real need to keep it around.

Unless email is broken, for whatever reason.  I suppose it might make
sense to keep a single log for each device (put the timestamp inside the
log) so that the space usage doesn't increase dramatically.  Having
logrotate do cleanup isn't so great.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From adilger at sun.com  Tue Jan 29 23:56:27 2008
From: adilger at sun.com (Andreas Dilger)
Date: Tue, 29 Jan 2008 16:56:27 -0700
Subject: forced fsck (again?)
In-Reply-To: <479E79D2.5070406@kadzban.is-a-geek.net>
References: <20080123140847.GB29321@mit.edu>
	<20080123192334.GG3180@webber.adilger.int>
	<4797F397.9020306@kadzban.is-a-geek.net>
	<20080124043930.GG18433@webber.adilger.int>
	<47988245.4010904@kadzban.is-a-geek.net>
	<47995564.2050402@kadzban.is-a-geek.net>
	<20080125085557.GV18433@webber.adilger.int>
	<479A94D0.9080308@kadzban.is-a-geek.net>
	<20080128174804.GT18433@webber.adilger.int>
	<479E79D2.5070406@kadzban.is-a-geek.net>
Message-ID: <20080129235627.GB23836@webber.adilger.int>

On Jan 28, 2008  19:56 -0500, Bryan Kadzban wrote:
> >> # Assume the script won't run more than one instance at a time?
> >> lvremove -f "${lvtemp##/dev}"
> > 
> > Should check the error return and bail out of script if there is an error.
> 
> Will that catch the "more than one instance at a time" case (e.g. if
> another script run is still running e2fsck on this snapshot)?  Assuming
> lvremove can fail (and it probably can), it's probably a good idea to
> check it in any case, but if running e2fsck makes lvremove fail (until
> e2fsck finishes), that's a decent way to get rid of the comment too.
> 
> Also, I think it'd be better to skip just the current FS, rather than an
> "exit 1" type bail-out, right?

It's a hard call...  In some sense if there is an error we may leave a
string of LVs around that are filling up the VG, but the presence of
the LV (and hopefully being unable to remove it while e2fsck is running)
also serves as a "locking" mechanism in case some e2fsck takes a very
long time to run.

I guess as long as we print something in the syslog, and the LV remains
in place with a suitably clear "this isn't very useful" name, then
eventually the user will notice it and delete it.

> - -----
> 
> Create a script to transparently run fsck in the background on any
> active LVM logical volumes, as long as the machine is on AC power, and
> that LV has been last checked more than a configurable number of days
> ago.  Also create an optional configuration file to set various options
> in the script.
> 
> Signed-Off-By: Bryan Kadzban <bryan at kadzban.is-a-geek.net>

You can add a Signed-Off-By: Andreas Dilger <adilger at sun.com> here,
as it does everything I think is needed at this point...

Probably good to put a version number in the script, along with
your name/email so it is clear what version a user is running.

> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> 
> iD8DBQFHnnnRS5vET1Wea5wRAw0iAJ9wcLyfBSaH5FSIJNH0YakzDCUvjwCgnJEH
> lPScP39vBYIIjOQPiftgDs8=
> =XjFF
> -----END PGP SIGNATURE-----

> #!/bin/sh
> #
> # lvcheck
> 
> # Released under the GNU General Public License, either version 2 or
> #  (at your option) any later version.
> 
> # Overview:
> #
> #  Run this from cron periodically (e.g. once per week).  If the
> #  machine is on AC power, it will run the checks; otherwise they will
> #  all be skipped.  (If the script can't tell whether the machine is
> #  on AC power, it will use a setting in the configuration file
> #  (/etc/lvcheck.conf) to decide whether to continue with the checks,
> #  or abort.)
> #
> #  The script will then decide which logical volumes are active, and
> #  can therefore be checked via an LVM snapshot.  Each of these LVs
> #  will be queried to find its last-check day, and if that was more
> #  than $INTERVAL days ago (where INTERVAL is set in the configuration
> #  file as well), or if the last-check day can't be determined, then
> #  the script will take an LVM snapshot of that LV and run fsck on the
> #  snapshot.  The snapshot will be set to use 1/500 the space of the
> #  source LV.  After fsck finishes, the snapshot is destroyed.
> #  (Snapshots are checked serially.)
> #
> #  Any LV that passes fsck should have its last-check time updated (in
> #  the real superblock, not the snapshot's superblock); any LV whose
> #  fsck fails will send an email notification to a configurable user
> #  ($EMAIL).  This $EMAIL setting is optional, but its use is highly
> #  recommended, since if any LV fails, it will need to be checked
> #  manually, offline.  Relevant messages are also sent to syslog.
> 
> # Set default values for configuration params.  Changes to these values
> #  will be overwritten on an upgrade!  To change these values, use
> #  /etc/lvcheck.conf.
> EMAIL='root'
> INTERVAL=30
> AC_UNKNOWN="CONTINUE"
> MINSNAP=256
> MINFREE=0
> 
> # send $2 to syslog, with severity $1
> # severities are emerg/alert/crit/err/warning/notice/info/debug
> function log() {
> 	local sev="$1"
> 	local msg="$2"
> 	local arg=
> 
> 	# log warning-or-higher messages to stderr as well
> 	[ "$sev" == "emerg" || "$sev" == "alert" || "$sev" == "crit" || \
> 			"$sev" == "err" || "$sev" == "warning" ] && arg=-s
> 
> 	logger -t lvcheck $arg -p user."$sev" -- "$msg"
> }
> 
> # determine whether the machine is on AC power
> function on_ac_power() {
> 	local any_known=no
> 
> 	# try sysfs power class first
> 	if [ -d /sys/class/power_supply ] ; then
> 		for psu in /sys/class/power_supply/* ; do
> 			if [ -r "${psu}/type" ] ; then
> 				type="`cat "${psu}/type"`"
> 
> 				# ignore batteries
> 				[ "${type}" = "Battery" ] && continue
> 
> 				online="`cat "${psu}/online"`"
> 
> 				[ "${online}" = 1 ] && return 0
> 				[ "${online}" = 0 ] && any_known=yes
> 			fi
> 		done
> 
> 		[ "${any_known}" = "yes" ] && return 1
> 	fi
> 
> 	# else fall back to AC adapters in /proc
> 	if [ -d /proc/acpi/ac_adapter ] ; then
> 		for ac in /proc/acpi/ac_adapter/* ; do
> 			if [ -r "${ac}/state" ] ; then
> 				grep -q on-line "${ac}/state" && return 0
> 				grep -q off-line "${ac}/state" && any_known=yes
> 			elif [ -r "${ac}/status" ] ; then
> 				grep -q on-line "${ac}/status" && return 0
> 				grep -q off-line "${ac}/status" && any_known=yes
> 			fi
> 		done
> 
> 		[ "${any_known}" = "yes" ] && return 1
> 	fi
> 
> 	if [ "$AC_UNKNOWN" == "CONTINUE" ] ; then
> 		return 0   # assume on AC power
> 	elif [ "$AC_UNKNOWN" == "ABORT" ] ; then
> 		return 1   # assume on battery
> 	else
> 		log "err" "Invalid value for AC_UNKNOWN in the config file"
> 		exit 1
> 	fi
> }
> 
> # attempt to force a check of $1 on the next reboot
> function try_force_check() {
> 	local dev="$1"
> 	local fstype="$2"
> 
> 	case "$fstype" in
> 	ext2|ext3)
> 		tune2fs -C 16000 "$dev"
> 		;;
> 	*)
> 		log "warning" "Don't know how to force a check on $fstype..."
> 		;;
> 	esac
> }
> 
> # attempt to set the last-check time on $1 to now, and the mount count to 0.
> function try_delay_checks() {
> 	local dev="$1"
> 	local fstype="$2"
> 
> 	case "$fstype" in
> 	ext2|ext3)
> 		tune2fs -C 0 -T now "$dev"
> 		;;
> 	*)
> 		log "warning" "Don't know how to delay checks on $fstype..."
> 		;;
> 	esac
> }
> 
> # print the date that $1 was last checked, in a format that date(1) will
> #  accept, or "Unknown" if we don't know how to find that date.
> function try_get_check_date() {
> 	local dev="$1"
> 	local fstype="$2"
> 
> 	case "$fstype" in
> 	ext2|ext3)
> 		dumpe2fs -h "$dev" 2>/dev/null | grep 'Last checked:' | \
> 				sed -e 's/Last checked:[[:space:]]*//'
> 		;;
> 	*)
> 		# TODO: add support for various FSes here
> 		echo "Unknown"
> 		;;
> 	esac
> }
> 
> # check the FS on $1 passively, saving output to $3.
> function perform_check() {
> 	local dev="$1"
> 	local fstype="$2"
> 	local tmpfile="$3"
> 
> 	case "$fstype" in
> 	ext2|ext3)
> 		nice logsave -as "${tmpfile}" e2fsck -fn "$dev"
> 		return $?
> 		;;
> 	reiserfs)
> 		echo Yes | nice logsave -as "${tmpfile}" fsck.reiserfs --check "$dev"
> 		# apparently can't fail?  let's hope not...
> 		return 0
> 		;;
> 	xfs)
> 		nice logsave -as "${tmpfile}" xfs_check "$dev"
> 		return $?
> 		;;
> 	jfs)
> 		nice logsave -as "${tmpfile}" fsck.jfs -fn "$dev"
> 		return $?
> 		;;
> 	*)
> 		log "warning" "Don't know how to check $fstype filesystems passively: assuming OK."
> 		;;
> 	esac
> }
> 
> # do everything needed to check and reset dates and counters on /dev/$1/$2.
> function check_fs() {
> 	local vg="$1"
> 	local lv="$2"
> 	local fstype="$3"
> 	local snapsize="$4"
> 
> 	local tmpfile=`mktemp -t lvcheck.log.XXXXXXXXXX`
> 	local errlog="/var/log/lvcheck-${vg}@${lv}-`date +'%Y%m%d'`"
> 	local snaplvbase="${lv}-lvcheck-temp"
> 	local snaplv="${snaplvbase}-`date +'%Y%m%d'`"
> 
> 	# clean up any left-over snapshot LVs
> 	for lvtemp in /dev/${vg}/${snaplvbase}* ; do
> 		if [ -e "$lvtemp" ] ; then
> 			# Assume the script won't run more than one instance at a time?
> 
> 			log "warning" "Found stale snapshot $lvtemp: attempting to remove."
> 
> 			if ! lvremove -f "${lvtemp##/dev}" ; then
> 				log "error" "Could not delete stale snapshot $lvtemp"
> 				return 1
> 			fi
> 		fi
> 	done
> 
> 	# and create this one
> 	lvcreate -s -l "$snapsize" -n "${snaplv}" "${vg}/${lv}"
> 
> 	if perform_check "/dev/${vg}/${snaplv}" "${fstype}" "${tmpfile}" ; then
> 		log "info" "Background scrubbing of /dev/${vg}/${lv} succeeded."
> 		try_delay_checks "/dev/${vg}/${lv}" "$fstype"
> 	else
> 		log "err" "Background scrubbing of /dev/${vg}/${lv} failed: run fsck offline soon!"
> 		try_force_check "/dev/${vg}/${lv}" "$fstype"
> 
> 		if test -n "$EMAIL"; then
> 			mail -s "Fsck of /dev/${vg}/${lv} failed!" $EMAIL < $tmpfile
> 		fi
> 
> 		# save the log file in /var/log in case mail is disabled
> 		mv "$tmpfile" "$errlog"
> 	fi
> 
> 	rm -f "$tmpfile"
> 	lvremove -f "${vg}/${snaplv}"
> }
> 
> # pull in configuration -- overwrite the defaults above if the file exists
> [ -r /etc/lvcheck.conf ] && . /etc/lvcheck.conf
> 
> # check whether the machine is on AC power: if not, skip fsck
> on_ac_power || exit 0
> 
> # parse up lvscan output
> lvscan 2>&1 | grep ACTIVE | awk '{print $2;}' | \
> while read DEV ; do
> 	# remove the single quotes around the device name
> 	DEV="`echo "$DEV" | tr -d \'`"
> 
> 	# get the FS type: blkid prints TYPE="blah"
> 	eval `blkid -s TYPE "$DEV" | cut -d' ' -f2`
> 
> 	# get the last-check time
> 	check_date=`try_get_check_date "$DEV" "$TYPE"`
> 
> 	# if the date is unknown, run fsck every time the script runs.  sigh.
> 	if [ "$check_date" != "Unknown" ] ; then
> 		# add $INTERVAL days, and throw away the time portion
> 		check_day=`date --date="$check_date $INTERVAL days" +'%Y%m%d'`
> 
> 		# get today's date, and skip the check if it's not within the interval
> 		today=`date +'%Y%m%d'`
> 		[ $check_day -gt $today ] && continue
> 	fi
> 
> 	# get the volume group and logical volume names
> 	VG="`lvs --noheadings -o vg_name "$DEV"`"
> 	LV="`lvs --noheadings -o lv_name "$DEV"`"
> 
> 	# get the free space and LV size (in megs), guess at the snapshot
> 	#  size, and see how much the admin will let us use (keeping MINFREE
> 	#  available)
> 	SPACE="`lvs --noheadings --units M --nosuffix -o vg_free "$DEV"`"
> 	SIZE="`lvs --noheadings --units M --nosuffix -o lv_size "$DEV"`"
> 	SNAPSIZE="`expr "$SIZE" / 500`"
> 	AVAIL="`expr "$SPACE" - "$MINFREE"`"
> 
> 	# if we don't even have MINSNAP space available, skip the LV
> 	if [ "$MINSNAP" -gt "$AVAIL" -o "$AVAIL" -le 0 ] ; then
> 		log "warning" "Not enough free space on volume group for ${DEV}; skipping"
> 		continue
> 	fi
> 
> 	# make snapshot large enough to handle e.g. journal and other updates
> 	[ "$SNAPSIZE" -lt "$MINSNAP" ] && SNAPSIZE="$MINSNAP"
> 
> 	# limit snapshot to available space (VG space minus min-free)
> 	[ "$SNAPSIZE" -gt "$AVAIL" ] && SNAPSIZE="$AVAIL"
> 
> 	# don't need to check SNAPSIZE again: MINSNAP <= AVAIL, MINSNAP <= SNAPSIZE,
> 	#  and SNAPSIZE <= AVAIL, combined, means SNAPSIZE must be between MINSNAP
> 	#  and AVAIL, which is what we need -- assuming AVAIL > 0
> 
> 	# check it
> 	check_fs "$VG" "$LV" "$TYPE" "$SNAPSIZE"
> done
> 

> #!/bin/sh
> 
> # e2check configuration file

Minor note - "lvscan configuration file".

> # This file follows the pattern of sshd_config: default
> #  values are shown here, commented-out.
> 
> #  EMAIL
> #   Address to send failure notifications to.  If empty,
> #   failure notifications will not be sent.
> 
> #EMAIL='root'
> 
> #  INTERVAL
> #   Days to wait between checks.  All LVs use the same
> #   INTERVAL, but the "days since last check" value can
> #   be different per LV, since that value is stored in
> #   the filesystem superblock.
> 
> #INTERVAL=30
> 
> #  AC_UNKNOWN
> #   Whether to run the e2fsck checks if the script can't
> #   determine whether the machine is on AC power.  Laptop
> #   users will want to set this to ABORT, while server and
> #   desktop users will probably want to set this to
> #   CONTINUE.  Those are the only two valid values.
> 
> #AC_UNKNOWN="CONTINUE"
> 
> #  MINSNAP
> #   Minimum snapshot size to take, in megabytes.  The
> #   default snapshot size is 1/500 the size of the logical
> #   volume, but if that size is less than MINSNAP, the
> #   script will use MINSNAP instead.  This should be large
> #   enough to handle e.g. journal updates, and other disk
> #   changes that require (semi-)constant space.
> 
> #MINSNAP=256
> 
> #  MINFREE
> #   Minimum amount of space (in megabytes) to keep free in
> #   each volume group when creating snapshots.
> 
> #MINFREE=0
> 


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.



From gregt at maths.otago.ac.nz  Wed Jan 30 02:01:58 2008
From: gregt at maths.otago.ac.nz (Greg Trounson)
Date: Wed, 30 Jan 2008 15:01:58 +1300
Subject: forced fsck (again?)
In-Reply-To: <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
References: <200801221701.50202.giancarlo.corti@supsi.ch>	<4796157E.5040803@redhat.com>
	<479615CA.1090408@redhat.com>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
Message-ID: <479FDA96.5080209@maths.otago.ac.nz>

Valerie Henson wrote:
...
> This will be ironic coming from me, but I think the ext3 defaults for
> forcing a file system check are a little too conservative for many
> modern use cases.  The two cases I have in mind in particular are:
> 
> * Servers with long uptimes that need very low data unavailability
> times.  Imagine you have a machine room full of servers that have all
> been up and running happily for more than 180 days - the preferred
> case.  Now imagine that the room overheats and the emergency power cut
> is tripped.  Standard heat reduction is swiftly applied (i.e., open
> the door and turn on a fan and hope security doesn't notice) and the
> power turned back on.  Now your entire machine room will be fscking
> for the next 3 hours and whatever service they provide will be
> completely unavailable.  Of course, any admin worth their salt will
> turn off force fsck so it only runs during controlled downtime...
> won't they?

Agreed.  This is a real problem.  And controlled downtime is rather difficult if it takes 
several hours to complete.  You're either without whatever services they provide or with 
reduced redundancy for that time.

> * Laptops.  If suspend and resume doesn't work on your laptop, you'll
> be rebooting (and remounting) a lot, perhaps several times a day.  The
> preferred solution is to get Matthew Garrett to fix your laptop, but
> if you can't, fscking every 10-30 days seems a little excessive.
> Desktop users who shutdown daily to save power will have similar
> problems.  Distros often have the "don't fsck on battery" option and
> some don't use the ext3 defaults for mkfs, but that's only a partial
> solution.  In this case, it's definitely a little much to ask a random
> laptop user to tune their file system.

Agreed again.  Having a laptop insist on an fsck when about to give a presentation to a 
room full of professors is really not a good look.  And being flimsier and more abused 
than desktops, laptops IMO really do need regular checking.

> I'm not sure what the best solution is ...

I am.

Since fscks are unacceptably inconvenient and apparently the only thing worse than 
enforcing periodic fscks is *not* enforcing periodic fscks, then we only have one option. 
  Make fscks less inconvenient.  And since we apparently can't make them any faster, the 
only way I can think of to do that is to add support for (you know what I'm going to say):

Online fscks.

We really, *really* need to support checking of mounted read/write file systems.  I would 
envisage a read-only fsck done on all mounted filesystems regularly, which wouldn't do any 
damage to a file system if implemented properly.  If an inconsistency is picked up, then 
recommend an offline one to be scheduled when the user/admin is ready.

Greg




From chris.mason at oracle.com  Mon Jan 14 18:16:48 2008
From: chris.mason at oracle.com (Chris Mason)
Date: Mon, 14 Jan 2008 18:16:48 -0000
Subject: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66)
In-Reply-To: <20080114170609.GH4214@duck.suse.cz>
References: <200712242302.lBON2O8s011190@agora.fsl.cs.sunysb.edu>
	<477BF72B.4000608@oracle.com> <20080114170609.GH4214@duck.suse.cz>
Message-ID: <20080114131454.37eb7c12@think.oraclecorp.com>

On Mon, 14 Jan 2008 18:06:09 +0100
Jan Kara <jack at suse.cz> wrote:

> On Wed 02-01-08 12:42:19, Zach Brown wrote:
> > Erez Zadok wrote:
> > > Setting: ltp-full-20071031, dio01 test on ext3 with Linus's
> > > latest tree. Kernel w/ SMP, preemption, and lockdep configured.
> > 
> > This is a real lock ordering problem.  Thanks for reporting it.
> > 
> > The updating of atime inside sys_mmap() orders the mmap_sem in the
> > vfs outside of the journal handle in ext3's inode dirtying:
> > 

[ lock inversion traces ]
 
> > Two fixes come to mind:
> > 
> > 1) use something like Peter's ->mmap_prepare() to update atime
> > before acquiring the mmap_sem.
> > ( http://lkml.org/lkml/2007/11/11/97 ).  I don't know if this would
> > leave more paths which do a journal_start() while holding the
> > mmap_sem.
> > 
> > 2) rework ext3's dio to only hold the jbd handle in
> > ext3_get_block(). Chris has a patch for this kicking around
> > somewhere but I'm told it has problems exposing old blocks in
> > ordered data mode.
> > 
> > Does anyone have preferences?  I could go either way.  I certainly
> > don't like the idea of journal handles being held across the
> > entirety of fs/direct-io.c.  It's yet another case of O_DIRECT
> > differing wildly from the buffered path :(.
>   I've looked more into it and I think that 2) is the only way to go
> since transaction start ranks below page lock (standard buffered
> write path) and page lock ranks below mmap_sem. So we have at least
> one more dependency mmap_sem must go before transaction start...

Just to clarify a little bit:

If ext3's DIO code only touches transactions in get_block, then it can
violate data=ordered rules.  Basically the transaction that allocates
the blocks might commit before the DIO code gets around to writing them.

A crash in the wrong place will expose stale data on disk.

-chris



From menscher at gmail.com  Thu Jan 24 08:24:19 2008
From: menscher at gmail.com (Damian Menscher)
Date: Thu, 24 Jan 2008 00:24:19 -0800
Subject: forced fsck (again?)
In-Reply-To: <4797F397.9020306@kadzban.is-a-geek.net>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
	<20080122225248.GD1659@mit.edu>
	<47969D69.4060607@kadzban.is-a-geek.net>
	<20080123031012.GD1320@mit.edu>
	<4796B60F.4040009@kadzban.is-a-geek.net>
	<20080123081548.GY3180@webber.adilger.int>
	<20080123140847.GB29321@mit.edu>
	<20080123192334.GG3180@webber.adilger.int>
	<4797F397.9020306@kadzban.is-a-geek.net>
Message-ID: <1d8411e00801240024yf31af33tb202e0bef44b5ec9@mail.gmail.com>

At the risk of adding complexity, what about having the SNAPSIZE be
automatically determined?  Most users would have no idea what to set
it to, and we should be able to guess some reasonable values.  For
example, the fsck time can probably be estimated by looking at the
number of inodes, how full the filesystem is, etc.  Alternatively, we
could just allocate all available space in the LVM.

I also have a newbie question: does the fsck of a snapshot really
catch everything that might be wrong with the drive, or are there
other failure modes that only a real fsck would catch?  I'm wondering
if it's still a good idea to do an occasional full fsck.

Damian

2008/1/23 Bryan Kadzban <bryan at kadzban.is-a-geek.net>:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
>
> Andreas Dilger wrote:
> > On Jan 23, 2008  09:08 -0500, Theodore Tso wrote:
> >> (We could sneek some of that information into the options field of
> >> fstab, since the kernel and other programs that parse that field
> >> just take what they need and ignore the rest, but.... ick, ick,
> >> ick.  :-)
> >
> > I agree - adding email to fstab is icky and I wouldn't go there.  I
> > don't see a problem with just emailing it to "root@" by default and
> > giving the user the option to change it to something else.
>
> Since the email address is not per-filesystem, it's fine by me to put it
> into a config file somewhere.  Forcing the interval to be global is
> probably also OK, although I wouldn't want to be forced to set the
> snapshot size globally.  I do think that fstab is the best place for
> per-filesystem options, though.
>
> But it's not too difficult to parse out a custom SNAPSIZE option, and
> even have a DEFAULT_SNAPSIZE in the config file if no SNAPSIZE option is
> present on any LV, if the script is going to parse fstab anyway.  (Or
> should the option's name be lowercase?  Either will work.)
>
> >> Also, I could imagine that a user might not want to check all of
> >> the filesystems in fstab.
> >
> > Similarly, a config file which disables checking on some LV if
> > specified seems reasonable.
>
> That does seem reasonable, but I haven't done it in the script that's
> attached.  Maybe support for a SKIP (or skip, or e2check_skip, or
> skip_e2check, or whatever) option in fstab's options field?
>
> Regarding the idea of having this support multiple filesystems -- that's
> a good idea, I think, but the current script is highly specific to ext2
> or ext3.  Use of tune2fs (to reset the last-check time) and dumpe2fs (to
> find the last-check time), in particular, will be problematic on other
> FSes.  I haven't done that in this script, though it may be possible.
>
> Anyway, here's a second version.  I've changed it to parse up fstab,
> and added an option for what to do if AC status can't be determined.
> Kernel-style changelog entry, etc., below:
>
> - -------
>
> Create a script to transparently run e2fsck in the background on any LVM
> logical volumes listed in /etc/fstab, as long as the machine is on AC
> power, and that LV has been last checked more than a configurable number
> of days ago.  Also create a configuration file to set various options in
> the script.
>
> Signed-Off-By: Bryan Kadzban <bryan at kadzban.is-a-geek.net>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFHl/OXS5vET1Wea5wRA/UaAJwIE27W6qasI7Gm/uvZm/pY1rcBtwCcDXYq
> cc3qE/uOEqm4ksYHlI6+IJU=
> =7Lf3
> -----END PGP SIGNATURE-----
>
> #!/bin/sh
> #
> # e2check
>
> # Released under the GNU General Public License, either version 2 or
> #  (at your option) any later version.
>
> # Overview:
> #
> #  Run this from cron each night.  If the machine is on AC power, it
> #  will run the checks; otherwise they will all be skipped.  (If the
> #  script can't tell whether the machine is on AC power, a setting in
> #  the configuration file (/etc/e2check.conf) decides whether it will
> #  continue or abort.)
> #
> #  The script will then decide which filesystems in /etc/fstab are on
> #  logical volumes, and can therefore be checked via an LVM snapshot.
> #  Each of these filesystems will be queried to find its last check
> #  day, and if that was more than $INTERVAL days ago (where INTERVAL
> #  is set in the configuration file as well), then the script will
> #  take an LVM snapshot of the filesystem and run e2fsck on the
> #  snapshot.  The snapshot's size can be set via either the SNAPSIZE
> #  option in the options field in /etc/fstab, or the DEFAULT_SNAPSIZE
> #  option in /etc/e2check.conf -- but make sure it's set large enough.
> #  After e2fsck finishes, the snapshot is destroyed.
> #
> #  Any filesystem that passes e2fsck will have its last-check time
> #  updated (in the real superblock, not the snapshot); any filesystem
> #  that fails will send an email notification to a configurable user
> #  ($EMAIL).  This $EMAIL setting is optional, but its use is highly
> #  recommended, since if any filesystem fails, it will need to be
> #  checked manually offline.
>
> function on_ac_power() {
>         local any_known=no
>
>         # try sysfs power class first
>         if [ -d /sys/class/power_supply ] ; then
>                 for psu in /sys/class/power_supply/* ; do
>                         if [ -r "${psu}/type" ] ; then
>                                 type="`cat "${psu}/type"`"
>
>                                 # ignore batteries
>                                 [ "${type}" = "Battery" ] && continue
>
>                                 online="`cat "${psu}/online"`"
>
>                                 [ "${online}" = 1 ] && return 0
>                                 [ "${online}" = 0 ] && any_known=yes
>                         fi
>                 done
>
>                 [ "${any_known}" = "yes" ] && return 1
>         fi
>
>         # else fall back to AC adapters in /proc
>         if [ -d /proc/acpi/ac_adapter ] ; then
>                 for ac in /proc/acpi/ac_adapter/* ; do
>                         if [ -r "${ac}/state" ] ; then
>                                 grep -q on-line "${ac}/state" && return 0
>                                 grep -q off-line "${ac}/state" && any_known=yes
>                         elif [ -r "${ac}/status" ] ; then
>                                 grep -q on-line "${ac}/status" && return 0
>                                 grep -q off-line "${ac}/status" && any_known=yes
>                         fi
>                 done
>
>                 [ "${any_known}" = "yes" ] && return 1
>         fi
>
>         if [ "$AC_UNKNOWN" == "CONTINUE" ] ; then
>                 return 0   # assume on AC power
>         elif [ "$AC_UNKNOWN" == "ABORT" ] ; then
>                 return 1   # assume on battery
>         else
>                 echo "Invalid value for AC_UNKNOWN in the config file" >&2
>                 exit 1
>         fi
> }
>
> function check_fs() {
>         local vg="$1"
>         local lv="$2"
>         local opts="$3"
>         local snapsize="${DEFAULT_SNAPSIZE}"
>
>         case "$opts" in
>                 *SNAPSIZE=*)
>                 # parse out just the SNAPSIZE option's value
>                 snapsize="${opts##*SNAPSIZE=}"
>                 snapsize="${snapsize%%,*}"
>                 ;;
>         esac   # else leave it at DEFAULT_SNAPSIZE
>
>         [ -z "$snapsize" ] && return 1
>
>         local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX`
>         trap "rm $tmpfile ; trap - RETURN" RETURN
>
>         local start="$(date +'%Y%m%d%H%M%S')"
>
>         lvcreate -s -L "${snapsize}" -n "${lv}-snap" "${vg}/${lv}"
>
>         if nice logsave -as $tmpfile e2fsck -p -C 0 "/dev/${vg}/${lv}-snap" && \
>                         nice logsave -as $tmpfile e2fsck -fy -C 0 "/dev/${vg}/${lv}-snap" ; then
>                 echo 'Background scrubbing succeeded!'
>                 tune2fs -C 0 -T "${start}" "/dev/${vg}/${lv}"
>         else
>                 echo 'Background scrubbing failed! Reboot to fsck soon!'
>                 tune2fs -C 16000 -T "19000101" "/dev/${vg}/${lv}"
>
>                 if test -n "$EMAIL"; then
>                         mail -s "E2fsck of /dev/${vg}/${lv} failed!" $EMAIL < $tmpfile
>                 fi
>         fi
>
>         lvremove -f "${vg}/${lv}-snap"
> }
>
> set -e
>
> # pull in configuration -- don't bother with a parser, just use the shell's
> . /etc/e2check.conf
>
> # check whether the machine is on AC power: if not, skip the e2fsck
> on_ac_power || exit 0
>
> # parse up fstab
> grep -v '^#' /etc/fstab | grep -v '^$' | awk '$6!=0 {print $1,$3,$4;}' | \
> while read FS FSTYPE OPTIONS ; do
>         # Use of tune2fs in check_fs, and dumpe2fs below, means we can
>         #  only handle ext2/ext3 FSes
>         [ "$FSTYPE" != "ext3" || "$FSTYPE" != "ext2" ] && continue
>
>         # get the volume group (or an error message)
>         VG="`lvs --noheadings -o vg_name "$FS" 2>&1`"
>
>         # skip non-LVM devices (hopefully LVM VGs don't have spaces)
>         [ "`echo "$VG" | awk '{print NF;}'`" -ne 1 ] && continue
>
>         # get the logical volume name
>         LV="`lvs --noheadings -o lv_name "$FS"`"
>
>         # get the last check time plus $INTERVAL days
>         check_date=`dumpe2fs -h "/dev/${VG}/${LV}" 2>/dev/null | grep 'Last checked:' | \
>                 sed -e 's/Last checked:[[:space:]]*//'`
>         check_day=`date --date="${check_date} $INTERVAL days" +"%Y%m%d"`
>
>         # get today's date, and skip LVs that don't need to be checked yet
>         today=`date +"%Y%m%d"`
>         [ "$check_day" -gt "$today" ] && continue
>
>         # else, check it
>         check_fs "$VG" "$LV" "$OPTIONS"
> done
>
>
> #!/bin/sh
>
> # e2check configuration variables:
> #
> #  EMAIL
> #   Address to send failure notifications to.  If empty,
> #   failure notifications will not be sent.
> #
> #  INTERVAL
> #   Days to wait between checks.  All LVs use the same
> #   INTERVAL, but the "days since last check" value can
> #   be different per LV, since that value is stored in
> #   the ext2/ext3 superblock.
> #
> #  DEFAULT_SNAPSIZE
> #   Default snapshot size to use if none is specified
> #   in the options field in /etc/fstab (using the custom
> #   SNAPSIZE=xxx option) for any LV.  Valid values are
> #   anything that the -L option to lvcreate will accept.
> #
> #  AC_UNKNOWN
> #   Whether to run the e2fsck checks if the script can't
> #   determine whether the machine is on AC power.  Laptop
> #   users will want to set this to ABORT, while server and
> #   desktop users will probably want to set this to
> #   CONTINUE.  Those are the only two valid values.
>
> EMAIL='root'
> INTERVAL=30
> DEFAULT_SNAPSIZE=100m
> AC_UNKNOWN="ABORT"
>
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
>



-- 
http://www.uiuc.edu/~menscher/



From chris.mason at oracle.com  Fri Jan 25 16:16:13 2008
From: chris.mason at oracle.com (Chris Mason)
Date: Fri, 25 Jan 2008 11:16:13 -0500
Subject: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66)
In-Reply-To: <20080125160931.GC1767@duck.suse.cz>
References: <200712242302.lBON2O8s011190@agora.fsl.cs.sunysb.edu>
	<20080114131454.37eb7c12@think.oraclecorp.com>
	<20080125160931.GC1767@duck.suse.cz>
Message-ID: <200801251116.13690.chris.mason@oracle.com>

On Friday 25 January 2008, Jan Kara wrote:

> > If ext3's DIO code only touches transactions in get_block, then it can
> > violate data=ordered rules.  Basically the transaction that allocates
> > the blocks might commit before the DIO code gets around to writing them.
> >
> > A crash in the wrong place will expose stale data on disk.
>
>   Hmm, I've looked at it and I don't think so - look at the rationale in
> the patch below... That patch should fix the lock-inversion problem (at
> least I see no lockdep warnings on my test machine).
>

Ah ok, when I was looking at this I was allowing holes to get filled without 
falling back to buffered.  But, with the orphan inode entry protecting things 
I see how you're safe with this patch.

-chris



From daviso at gmail.com  Thu Jan 31 22:38:51 2008
From: daviso at gmail.com (Davi Santos Oliveira)
Date: Thu, 31 Jan 2008 20:38:51 -0200
Subject: Ext3 Repair
Message-ID: <d644013d0801311438wb0cacb2j93c8962f4cfb16ea@mail.gmail.com>

Hello,

First, sorry for my english.

I'm new in this list, and I'm having troubles because a lack of disk on my
Raid 5.

The server have a LVM system on the Raid 5 and the partitions on LVM is
ext3.

I can't identify where the ext3 superblock is on this LVM partition to use
the fsck.

I've tried many ways:

fsck -b 8192 /dev/VolGroup/LogVol04
dumpe2fs /dev/VolGroup/LogVol04 | grep -i superblock

I tried to use the testdisk, and nothing of these solves my problem, i need
to recover the files from an ext3 partition or repair this partition, what
sounds better to me.

Can anyone helps me?

[]'s

-- 
Davi Santos Oliveira
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/ext3-users/attachments/20080131/8beae8b6/attachment.htm>

From mb--ext3 at dcs.qmul.ac.uk  Thu Jan 31 16:27:48 2008
From: mb--ext3 at dcs.qmul.ac.uk (Matt Bernstein)
Date: Thu, 31 Jan 2008 16:27:48 +0000 (GMT)
Subject: forced fsck (again?)
In-Reply-To: <20080122225248.GD1659@mit.edu>
References: <200801221701.50202.giancarlo.corti@supsi.ch>
	<4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com>
	<70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com>
	<20080122225248.GD1659@mit.edu>
Message-ID: <alpine.LFD.1.00.0801311559060.29071@toil.dcs.qmul.ac.uk>

On Jan 22 Theodore Tso wrote:

> #!/bin/sh
> #
> # e2croncheck
>
> VG=closure
> VOLUME=root
> SNAPSIZE=100m
> EMAIL=tytso at mit.edu

[snip]

> Well, this isn't a complete solution, because a lot of people don't
> use LVM

Please forgive my late noticing of this. The idea is good, and will work 
fine in 99% of cases.

I'd love to snapshot (for rsync as well as fsck) my large filesystems, 
which have external journals which in turn are in a different VG.

I suspect that if I were to na?vely run your script, really interesting 
things would be likely to happen ;)

So.. I'd love to atomically make two snapshots, but I guess that is Hard 
(or would at least require a very coarse lock). I suppose in the meantime 
I could "tune2fs -O ^has_journal" the snapshot volume, but I'm too scared 
even to do that.

So.. maybe I could request that you either include a Big Fat Disclaimer, 
or code based on the following (untested, you can probably do better)?

if (tune2fs -l /dev/${VG}/${VOLUME}|egrep -q "Journal device")
then
 	echo "Cowardly refusing to play with external journals."
 	echo "There be dragons!"
 	exit 1
fi