From zach.brown at oracle.com Wed Jan 2 20:42:19 2008 From: zach.brown at oracle.com (Zach Brown) Date: Wed, 02 Jan 2008 12:42:19 -0800 Subject: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66) In-Reply-To: <200712242302.lBON2O8s011190@agora.fsl.cs.sunysb.edu> References: <200712242302.lBON2O8s011190@agora.fsl.cs.sunysb.edu> Message-ID: <477BF72B.4000608@oracle.com> Erez Zadok wrote: > Setting: ltp-full-20071031, dio01 test on ext3 with Linus's latest tree. > Kernel w/ SMP, preemption, and lockdep configured. This is a real lock ordering problem. Thanks for reporting it. The updating of atime inside sys_mmap() orders the mmap_sem in the vfs outside of the journal handle in ext3's inode dirtying: > -> #1 (jbd_handle){--..}: > [] __lock_acquire+0x9cc/0xb95 > [] lock_acquire+0x5f/0x78 > [] journal_start+0xee/0xf8 > [] ext3_journal_start_sb+0x48/0x4a > [] ext3_dirty_inode+0x27/0x6c > [] __mark_inode_dirty+0x29/0x144 > [] touch_atime+0xb7/0xbc > [] generic_file_mmap+0x2d/0x42 > [] mmap_region+0x1e6/0x3b4 > [] do_mmap_pgoff+0x1fb/0x253 > [] sys_mmap2+0x9b/0xb5 > [] syscall_call+0x7/0xb > [] 0xffffffff ext3_direct_IO() orders the journal handle outside of the mmap_sem that dio_get_page() acquires to pin pages with get_user_pages(): > -> #0 (&mm->mmap_sem){----}: > [] __lock_acquire+0x8bc/0xb95 > [] lock_acquire+0x5f/0x78 > [] down_read+0x3a/0x4c > [] dio_get_page+0x4e/0x15d > [] __blockdev_direct_IO+0x431/0xa81 > [] ext3_direct_IO+0x10c/0x1a1 > [] generic_file_direct_IO+0x124/0x139 > [] generic_file_direct_write+0x56/0x11c > [] __generic_file_aio_write_nolock+0x33d/0x489 > [] generic_file_aio_write+0x58/0xb6 > [] ext3_file_write+0x27/0x99 > [] do_sync_write+0xc5/0x102 > [] vfs_write+0x90/0x119 > [] sys_write+0x3d/0x61 > [] sysenter_past_esp+0x5f/0xa5 > [] 0xffffffff Two fixes come to mind: 1) use something like Peter's ->mmap_prepare() to update atime before acquiring the mmap_sem. ( http://lkml.org/lkml/2007/11/11/97 ). I don't know if this would leave more paths which do a journal_start() while holding the mmap_sem. 2) rework ext3's dio to only hold the jbd handle in ext3_get_block(). Chris has a patch for this kicking around somewhere but I'm told it has problems exposing old blocks in ordered data mode. Does anyone have preferences? I could go either way. I certainly don't like the idea of journal handles being held across the entirety of fs/direct-io.c. It's yet another case of O_DIRECT differing wildly from the buffered path :(. - z From fasihullah.askiri at gmail.com Thu Jan 3 10:30:22 2008 From: fasihullah.askiri at gmail.com (Fasihullah Askiri) Date: Thu, 3 Jan 2008 16:00:22 +0530 Subject: read() on a deleted file Message-ID: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com> Hi all I have a doubt regarding the behaviour of read() on an ext3 filesystem. To elucidate my doubts, I wrote a small program opens a file and reads one byte at a time and sleeps for a while. I deleted the file while the read was still in progress and I noticed that the read still succeeds. How does this work? Does the kernel not free the inode when the file is deleted but there is a pending read? To check this, instead of deleting, I tried shred-ding the file, the read still gets the correct data. My questions: - Where does the kernel get the data from? - Is this a documented feature which I can use? - Does shred write the file inode with junk? Thanks for your patience -- Keep Running.... And Relish the run... +Fasih From alex at alex.org.uk Thu Jan 3 10:49:24 2008 From: alex at alex.org.uk (Alex Bligh) Date: Thu, 03 Jan 2008 10:49:24 +0000 Subject: read() on a deleted file In-Reply-To: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com> References: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com> Message-ID: --On 3 January 2008 16:00:22 +0530 Fasihullah Askiri wrote: > I have a doubt regarding the behaviour of read() on an ext3 > filesystem. To elucidate my doubts, I wrote a small program opens a > file and reads one byte at a time and sleeps for a while. I deleted > the file while the read was still in progress and I noticed that the > read still succeeds. How does this work? Does the kernel not free the > inode when the file is deleted but there is a pending read? To check > this, instead of deleting, I tried shred-ding the file, the read still > gets the correct data. That's standard UNIX behaviour. The file exists on disk until all references to it have disappeared (references including the open file handle). All you do by typing "rm" is delete a reference/link to it from a particular directory, not (necessarily) delete the file. That's why the system call is called "unlink". Alex From fasihullah.askiri at gmail.com Thu Jan 3 11:12:40 2008 From: fasihullah.askiri at gmail.com (Fasihullah Askiri) Date: Thu, 3 Jan 2008 16:42:40 +0530 Subject: read() on a deleted file In-Reply-To: References: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com> Message-ID: <80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com> Thanx for the response. That is why I tried shred-ding the file. I believe that shred overwrites the file inode, if so, shred should have led to failures of read() which is not the case. How does that happen? On Jan 3, 2008 4:19 PM, Alex Bligh wrote: > > > --On 3 January 2008 16:00:22 +0530 Fasihullah Askiri > wrote: > > > I have a doubt regarding the behaviour of read() on an ext3 > > filesystem. To elucidate my doubts, I wrote a small program opens a > > file and reads one byte at a time and sleeps for a while. I deleted > > the file while the read was still in progress and I noticed that the > > read still succeeds. How does this work? Does the kernel not free the > > inode when the file is deleted but there is a pending read? To check > > this, instead of deleting, I tried shred-ding the file, the read still > > gets the correct data. > > That's standard UNIX behaviour. The file exists on disk until all > references to it have disappeared (references including the open > file handle). All you do by typing "rm" is delete a reference/link to > it from a particular directory, not (necessarily) delete the file. > That's why the system call is called "unlink". > > Alex > -- Keep Running.... And Relish the run... +Fasih From liuyue at ncic.ac.cn Thu Jan 3 11:20:37 2008 From: liuyue at ncic.ac.cn (liuyue) Date: Thu, 3 Jan 2008 19:20:37 +0800 Subject: ext3 peformance problem Message-ID: <20080103111526.B22051368B1@ncic.ac.cn> After doing some tests, I think I have found out the reasons. The read/write performance of an hard disk is not homogenous. The beginning of the disk is stored on the further cylinders from the center, while the end of the disk is stored on the cylinders close to the center. Because the disk rotation speed is constant, and the information density is constant, the raw performance of the disk is not the same for all cylinders. The performance degradation can be up to 50% when comparing performances at the beginning of the disk and at the end of the disk. I creat a small partition on the disk (using the first 2000 cylinders of the total 17000 cylinders on the disk), mount ext3 file system on it and test the performances of different directories. The performances under different directories are nearly the same :) ======= 2008-01-03 19:06:11 ????????======= >ext3-usershello all, > > I am testing ext3 file system recently but find some problem > I use GreatTurbo Enterprise Server 10 (Zuma) and 2.6.20 kernel > I conducted my test as follows: > > mkfs.ext3 /dev/sdb1 > mount /dev/sdb1 /mnt/test > cd /mnt/test > mkdir 0 1 2 3 4 5 6 7 > > I test write and read performance under different subdirs and give the performance result. I also use filefrag to see the file layout. > >Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/0/tmpfile -c -e -+n -w > 5242880 1024 72706 0 80474 0 > /mnt/test/0/tmpfile: 44 extents found, perfection would be 41 extents > >Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/1/tmpfile -c -e -+n -w > 5242880 1024 49957 0 52899 0 >/mnt/test/1/tmpfile: 42 extents found, perfection would be 41 extents > >Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/2/tmpfile -c -e -+n -w > 5242880 1024 60292 0 64664 0 > /mnt/test/2/tmpfile: 42 extents found, perfection would be 41 extents > > Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/3/tmpfile -c -e -+n -w > 5242880 1024 70540 0 78644 0 > /mnt/test/3/tmpfile: 46 extents found, perfection would be 41 extents > > Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/4/tmpfile -c -e -+n -w > 5242880 1024 61334 0 67778 0 > /mnt/test/4/tmpfile: 44 extents found, perfection would be 41 extents > > Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile -c -e -+n -w > 5242880 1024 66735 0 75114 0 > /mnt/test/5/tmpfile: 42 extents found, perfection would be 41 extents > > Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/6/tmpfile -c -e -+n -w > 5242880 1024 65062 0 72686 0 > /mnt/test/6/tmpfile: 44 extents found, perfection would be 41 extents > > Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/7/tmpfile -c -e -+n -w > 5242880 1024 69247 0 78563 0 > /mnt/test/7/tmpfile: 45 extents found, perfection would be 41 extents > > Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/tmpfile -c -e -+n -w > 5242880 1024 77085 0 81696 0 > > Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile2 -c -e -+n -w > /mnt/test/5/tmpfile2: 48 extents found, perfection would be 41 extents > 5242880 1024 57776 0 64870 0 > > Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile3 -c -e -+n -w > 5242880 1024 54799 0 59145 0 > /mnt/test/5/tmpfile3: 44 extents found, perfection would be 41 extents > > My questions are: >1. why the performances under different subdirs varies so much? In /mnt/test/0 the performance is 72/80, while in /mnt/test/1 the performance is 49/53 >2. I see that the extents of all files are nearly the same, but their performances are different. What are the other factors that influence the performance except for the extents(fragmentation) of the file? >3. Is it true that the more files there already exists in a dir, the lower performance we will get if we test under the dir? >as in my test, the performance of /mnt/test/5/tmpfile is 66/75, while the performances of /mnt/test/5/tmpfile2 and tmpfile3 are 57/64 54/59 > >Thanks very much > > > > = = = = = = = = = = = = = = = = = = = = ????????? ?? ????????liuyue ????????liuyue at ncic.ac.cn ??????????2008-01-03 From alex at alex.org.uk Thu Jan 3 11:31:42 2008 From: alex at alex.org.uk (Alex Bligh) Date: Thu, 03 Jan 2008 11:31:42 +0000 Subject: read() on a deleted file In-Reply-To: <80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com> References: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com> <80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com> Message-ID: --On 3 January 2008 16:42:40 +0530 Fasihullah Askiri wrote: > Thanx for the response. That is why I tried shred-ding the file. I > believe that shred overwrites the file inode, if so, shred should have > led to failures of read() which is not the case. How does that happen? Buffering / caching of reads. Alex From fasihullah.askiri at gmail.com Thu Jan 3 12:05:03 2008 From: fasihullah.askiri at gmail.com (Fasihullah Askiri) Date: Thu, 3 Jan 2008 17:35:03 +0530 Subject: read() on a deleted file In-Reply-To: <1199364407.2930.4.camel@alon> References: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com> <80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com> <1199364407.2930.4.camel@alon> Message-ID: <80cd17810801030405o333d2bcdkb93d831a1d4640@mail.gmail.com> What I meant was, instead of deleting, I tried shredding the file. The result was still consistent reads. However, after the mail from Alex, I increased the filesize to see how much does it cache. Turns out on my system, the read starts returning junk data [that written by shred] after reading 1040 bytes correctly. This is what I understand now, if I delete the file, the kernel guarantees that the file data is preserved till the last reference (in the form of an open filehandle maybe) lingers. If I shred the file, the read succeeds till the buffering is done. This, however sounds wierd to me, what we are essentially saying is that the open/read might not return the latest data!!!! AFAIK the buffer cache/inode cache that the kernel maintains is refreshed as soon the file is modified. Please clarify. Thanks again for the responses. On Jan 3, 2008 6:16 PM, Hayim Shaul wrote: > On Thu, 2008-01-03 at 16:42 +0530, Fasihullah Askiri wrote: > > Thanx for the response. That is why I tried shred-ding the file. I > > believe that shred overwrites the file inode, if so, shred should have > > led to failures of read() which is not the case. How does that happen? > > > > What do you mean by re-writing? > Do you mean opening a new file with the same name and writing into it? > > i don't think the new file (necessarily) gets the same inode as the file > you deleted. > More specifically, while the inode of the "deleted" file still exists, > the new inode would most likely to be different. > > -- Keep Running.... And Relish the run... +Fasih From ling at fnal.gov Thu Jan 3 15:51:07 2008 From: ling at fnal.gov (Ling C. Ho) Date: Thu, 03 Jan 2008 09:51:07 -0600 Subject: ext3 peformance problem In-Reply-To: <20080103111526.B22051368B1@ncic.ac.cn> References: <20080103111526.B22051368B1@ncic.ac.cn> Message-ID: <477D046B.1050400@fnal.gov> I find the "oldalloc" option helpful when doing tests like this, even if you are writing to a single huge directory. Files/dirs will always be written close to each other on the disk physically starting from the beginning of the disk. ... ling liuyue wrote: > After doing some tests, I think I have found out the reasons. > > The read/write performance of an hard disk is not homogenous. The beginning of the disk is stored on the further cylinders from the center, while the end of the disk is stored on the cylinders close to the center. Because the disk rotation speed is constant, and the information density is constant, the raw > performance of the disk is not the same for all cylinders. The performance degradation can be up to 50% when comparing performances at > the beginning of the disk and at the end of the disk. > > I creat a small partition on the disk (using the first 2000 cylinders of the total 17000 cylinders on the disk), mount ext3 file system on it and test the performances of different directories. The performances under different directories are nearly the same :) > > > ======= 2008-01-03 19:06:11 ????????======= > > >> ext3-usershello all, >> >> I am testing ext3 file system recently but find some problem >> I use GreatTurbo Enterprise Server 10 (Zuma) and 2.6.20 kernel >> I conducted my test as follows: >> >> mkfs.ext3 /dev/sdb1 >> mount /dev/sdb1 /mnt/test >> cd /mnt/test >> mkdir 0 1 2 3 4 5 6 7 >> >> I test write and read performance under different subdirs and give the performance result. I also use filefrag to see the file layout. >> >> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/0/tmpfile -c -e -+n -w >> 5242880 1024 72706 0 80474 0 >> /mnt/test/0/tmpfile: 44 extents found, perfection would be 41 extents >> >> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/1/tmpfile -c -e -+n -w >> 5242880 1024 49957 0 52899 0 >> /mnt/test/1/tmpfile: 42 extents found, perfection would be 41 extents >> >> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/2/tmpfile -c -e -+n -w >> 5242880 1024 60292 0 64664 0 >> /mnt/test/2/tmpfile: 42 extents found, perfection would be 41 extents >> >> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/3/tmpfile -c -e -+n -w >> 5242880 1024 70540 0 78644 0 >> /mnt/test/3/tmpfile: 46 extents found, perfection would be 41 extents >> >> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/4/tmpfile -c -e -+n -w >> 5242880 1024 61334 0 67778 0 >> /mnt/test/4/tmpfile: 44 extents found, perfection would be 41 extents >> >> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile -c -e -+n -w >> 5242880 1024 66735 0 75114 0 >> /mnt/test/5/tmpfile: 42 extents found, perfection would be 41 extents >> >> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/6/tmpfile -c -e -+n -w >> 5242880 1024 65062 0 72686 0 >> /mnt/test/6/tmpfile: 44 extents found, perfection would be 41 extents >> >> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/7/tmpfile -c -e -+n -w >> 5242880 1024 69247 0 78563 0 >> /mnt/test/7/tmpfile: 45 extents found, perfection would be 41 extents >> >> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/tmpfile -c -e -+n -w >> 5242880 1024 77085 0 81696 0 >> >> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile2 -c -e -+n -w >> /mnt/test/5/tmpfile2: 48 extents found, perfection would be 41 extents >> 5242880 1024 57776 0 64870 0 >> >> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile3 -c -e -+n -w >> 5242880 1024 54799 0 59145 0 >> /mnt/test/5/tmpfile3: 44 extents found, perfection would be 41 extents >> >> My questions are: >> 1. why the performances under different subdirs varies so much? In /mnt/test/0 the performance is 72/80, while in /mnt/test/1 the performance is 49/53 >> 2. I see that the extents of all files are nearly the same, but their performances are different. What are the other factors that influence the performance except for the extents(fragmentation) of the file? >> 3. Is it true that the more files there already exists in a dir, the lower performance we will get if we test under the dir? >> as in my test, the performance of /mnt/test/5/tmpfile is 66/75, while the performances of /mnt/test/5/tmpfile2 and tmpfile3 are 57/64 54/59 >> >> Thanks very much >> >> >> >> >> > > = = = = = = = = = = = = = = = = = = = = > > > ????????? > ?? > > > ????????liuyue > ????????liuyue at ncic.ac.cn > ??????????2008-01-03 > > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > > From davids at webmaster.com Thu Jan 3 20:45:40 2008 From: davids at webmaster.com (David Schwartz) Date: Thu, 3 Jan 2008 12:45:40 -0800 Subject: read() on a deleted file In-Reply-To: <80cd17810801030405o333d2bcdkb93d831a1d4640@mail.gmail.com> Message-ID: > This is what I understand now, if I delete the file, the kernel > guarantees that the file data is preserved till the last reference (in > the form of an open filehandle maybe) lingers. If I shred the file, > the read succeeds till the buffering is done. Actually, you can't delete a file while there are references to it. You can remove it from its directory, which reduces the reference count by one, but that's it. That's why the system call in UNIX is called "unlink" rather than "delete". A file is automatically deleted when its reference count goes to zero. Putting a file in a directory adds one to its reference count. Opening a file adds one. > This, however sounds wierd to me, what we are essentially saying is > that the open/read might not return the latest data!!!! AFAIK the > buffer cache/inode cache that the kernel maintains is refreshed as > soon the file is modified. Please clarify. It's impossible to clarify unless you tell us more precisely what you are doing. For example, you use the term "shred", but that can mean way more than one thing. Also, when you talk about "reading" a file, that could mean the "read" system call, but it could also mean the "fread" library function. DS From fasihullah.askiri at gmail.com Fri Jan 4 06:08:53 2008 From: fasihullah.askiri at gmail.com (Fasihullah Askiri) Date: Fri, 4 Jan 2008 11:38:53 +0530 Subject: read() on a deleted file In-Reply-To: References: <80cd17810801030405o333d2bcdkb93d831a1d4640@mail.gmail.com> Message-ID: <80cd17810801032208g78509444q9d37162361c68156@mail.gmail.com> Hi Sorry for the confusion caused. I just realized that I was using fread and not read. By "shred", I meant the /usr/bin/shred program which overwrites a file with junk. I was getting stale result because of the buffering at fread. Thanks again for the responses. On Jan 4, 2008 2:15 AM, David Schwartz wrote: > > > This is what I understand now, if I delete the file, the kernel > > guarantees that the file data is preserved till the last reference (in > > the form of an open filehandle maybe) lingers. If I shred the file, > > the read succeeds till the buffering is done. > > Actually, you can't delete a file while there are references to it. You can > remove it from its directory, which reduces the reference count by one, but > that's it. That's why the system call in UNIX is called "unlink" rather than > "delete". > > A file is automatically deleted when its reference count goes to zero. > Putting a file in a directory adds one to its reference count. Opening a > file adds one. > > > This, however sounds wierd to me, what we are essentially saying is > > that the open/read might not return the latest data!!!! AFAIK the > > buffer cache/inode cache that the kernel maintains is refreshed as > > soon the file is modified. Please clarify. > > It's impossible to clarify unless you tell us more precisely what you are > doing. For example, you use the term "shred", but that can mean way more > than one thing. Also, when you talk about "reading" a file, that could mean > the "read" system call, but it could also mean the "fread" library function. > > DS > > > -- Keep Running.... And Relish the run... +Fasih From evoltech at 2inches.com Sat Jan 5 05:28:33 2008 From: evoltech at 2inches.com (Dennison Williams) Date: Fri, 04 Jan 2008 21:28:33 -0800 Subject: ext3 filesystem is not recognized Message-ID: <477F1581.6070003@2inches.com> Hello all, I have a few ext3 file systems that are not being recognized. Here is the setup: MD software RAID 5 on 4 disks (md0), a LVM logical volume (/dev/volume_group/logical_volume) comprised of one physical device (/dev/md0), a encryption layer provided by the cryptoloop driver (losetup -e aes /dev/loop0 /dev/volume_group/logical_volume), then a EXT3 file system (mkfs.ext3 /dev/loop0). Recently the RAID device kicked out one of the disks during a large file transfer. After re-adding the disk to the array (smartctl didn't report anything wrong with it, I am not sure why this happened), authenticating against the cryptographic layer, then trying to mount the drive, I get the following error: [root at storage redhat]# mount -t ext3 /dev/loop1 /terrorbyte/1/ mount: wrong fs type, bad option, bad superblock on /dev/loop1, The message in /var/log/message is: VFS: Can't find ext3 filesystem on dev loop1. I then tried to e2fsck the /dev/loop1 partition with all of the different blocks that were reported from: mke2fs -n /dev/loop1 with no luck still. I am unsure of where the problem actually is, and how to go about debugging it. Any suggestions would be appreciated. Sincerely, Dennison Williams -- ***************************************************************** * To communicate with me securely, please email me and I will * * send you my public key. We can then verify each others * * fingerprints in person, or over the phone. * * * * I am open and willing to talk about setting up PGP, the * * security problems inherent with PGP, and alternatives to PGP * * for secure electronic communication. * ***************************************************************** From darkonc at gmail.com Sat Jan 5 07:06:19 2008 From: darkonc at gmail.com (Stephen Samuel) Date: Fri, 4 Jan 2008 23:06:19 -0800 Subject: ext3 filesystem is not recognized In-Reply-To: <477F21BA.3040007@2inches.com> References: <477F1581.6070003@2inches.com> <6cd50f9f0801042201r16977affw5ec3804ea580c8e2@mail.gmail.com> <477F21BA.3040007@2inches.com> Message-ID: <6cd50f9f0801042306h5a800d62h76a33dc8a66f1678@mail.gmail.com> Given what you've described, then only drive that it would make sense to pull out would be the one that was dropped and then re-inserted. On Jan 4, 2008 10:20 PM, Dennison Williams wrote: > > Did you try and re-insert the kicked-out drive as if it was clean, or > did > > you try to re-sync it to the existing filesystem. If the former, then > > that's a HUGE mistake because the data on the drive is no longer in sync > > with what is on the other drives. (unless the entire filesystem was made > > read-only when (or before) the drive was dropped out.) > > I re-inserted it with: > mdadm /dev/md0 --add /dev/sde > At which point it seemed to resync with the raid device (ie. the output > of /proc/mdstat showed that it was incrementally syncing) > > > Check the SMART logs for each of the drives to see if they've had any > > problems. > > there are messages like this: > /dev/sdc, failed to read SMART Attribute Data > ...but this wasn't one of the disks that was removed from the raid device If there are complaints about SDC, then I'd be inclined to do a long test of it in smart. it's possible that the real problem started here. A badblock read test (or just a dd if=/dev/sdc of=/dev/null) would also test the I/O path between the drive and the CPU. If there are complaints about that drive, then .. at this point, you should consider it suspicious. > > Try pullling the (candidate) compromized drive out of the array and see if > the (degraded) filesystem works OK and has good data. If it does, then I'd > guess that the pulled drive had bad data written to it somehow --- re-add it > (as if it was hot-swapped in), and hope it doesn't happen again. > Try that with each of the drives, in turn until you find the badly written > drive. If one of the drives has badly written data, the system really can't > tell, for sure, which one is wrong. I want to make sure I understand you here. Say my raid device is > comprised of for devices /dev/md0 = /dev/sd[abcd], are you sugesting > that for each drive I do somthing like this: > > mdadm /dev/md0 --fail /dev/sda --remove /dev/sda Don't bother. If the drive got resynced, then pulling it won't do any good unless software RAID gets silently confused by random data on one plex, > > > then try to mount up the FS as usual to see if it is there? Wouldn't > this point be moot if the device already re-assembled itself? > Yes. it would be moot. > > > > > [[ unless the array was read-only when the drive was dropped, then you > will > > only have any hope of good data with the dropped drive pulled ]] > > It wasn't read-only, but nothing was writing to it. > > Thanks for your time and prompt response. > Sincerely, > Dennison Williams > Unless noatime was set, then the drive was being written to (if only atime data). if all that got scrambled was atime data you should still have been able to mount the drive. -- Stephen Samuel http://www.bcgreen.com 778-861-7641 -------------- next part -------------- An HTML attachment was scrubbed... URL: From evoltech at 2inches.com Sun Jan 6 08:15:26 2008 From: evoltech at 2inches.com (Dennison Williams) Date: Sun, 06 Jan 2008 00:15:26 -0800 Subject: ext3 filesystem is not recognized In-Reply-To: <6cd50f9f0801042306h5a800d62h76a33dc8a66f1678@mail.gmail.com> References: <477F1581.6070003@2inches.com> <6cd50f9f0801042201r16977affw5ec3804ea580c8e2@mail.gmail.com> <477F21BA.3040007@2inches.com> <6cd50f9f0801042306h5a800d62h76a33dc8a66f1678@mail.gmail.com> Message-ID: <47808E1E.1030406@2inches.com> Stephen Samuel wrote: > Given what you've described, then only drive that it would make sense to > pull out would be the one that was dropped and then re-inserted. I did this with the following set of commands: mdadm -S /dev/md1 mdadm -A /dev/md1 /dev/sdf /dev/sdg /dev/sdh mdadm --run /dev/md1 lvchange -a y /dev/volume_group/logical_volume losetup -e aes /dev/loop1 /dev/volume_group/logical_volume mount -t ext3 -o ro /dev/loop1 /mnt/logical_volume and got the same errors; "mount: wrong fs type, bad option, bad superblock on /dev/loop1" >>> Check the SMART logs for each of the drives to see if they've had any >>> problems. >> there are messages like this: >> /dev/sdc, failed to read SMART Attribute Data >> ...but this wasn't one of the disks that was removed from the raid device > > If there are complaints about SDC, then I'd be inclined to do a long test of > it > in smart. it's possible that the real problem started here. > > A badblock read test (or just a dd if=/dev/sdc of=/dev/null) would also test > the I/O path between the drive and the CPU. If there are complaints about > that drive, then .. at this point, you should consider it suspicious. Ran "dd if=/dev/sdc of=/dev/null" while monitoring /var/log/messages, with no messages. Must have been a fluke. I will try doing a extended run of smartctl. >> Try pullling the (candidate) compromized drive out of the array and see if >> the (degraded) filesystem works OK and has good data. If it does, then > I'd >> guess that the pulled drive had bad data written to it somehow --- re-add > it >> (as if it was hot-swapped in), and hope it doesn't happen again. >> Try that with each of the drives, in turn until you find the badly > written >> drive. If one of the drives has badly written data, the system really > can't >> tell, for sure, which one is wrong. I ended up doing this with each drive as above and still the FS wasn't recognized. One thing that confuses me though is that the data seems to be partially valid. When the array device is assembled and running the logical volume is recognized, and furthermore losetup accepts the correct password. The only thing that doesn't seem to be in working order is the ext3 filesystem. In the linux encryption howto (http://encryptionhowto.sourceforge.net/Encryption-HOWTO-6.html, section 6.1), there is a entry describing possible problems if the kernel was compiled without CONFIG_BLK_LOOP_DEV_USE_REL_BLOCK. I can't find this option anywhere in the config for my kernel (2.6.18-1.2798.fc6xen). At this point I am thinking that the problem is at the cryptoloop or ext3 level, but I am not sure what else I can do to check. Any more ideas? Sincerely, Dennison Williams From liuyue at ncic.ac.cn Mon Jan 7 02:55:27 2008 From: liuyue at ncic.ac.cn (liuyue) Date: Mon, 7 Jan 2008 10:55:27 +0800 Subject: ext3 peformance problem Message-ID: <20080107024936.85CB9136935@ncic.ac.cn> It does help !! ======= 2008-01-03 23:51:07 ????????======= >I find the "oldalloc" option helpful when doing tests like this, even if >you are writing to a single huge directory. Files/dirs will always be >written close to each other on the disk physically starting from the >beginning of the disk. > >... >ling > >liuyue wrote: >> After doing some tests, I think I have found out the reasons. >> >> The read/write performance of an hard disk is not homogenous. The beginning of the disk is stored on the further cylinders from the center, while the end of the disk is stored on the cylinders close to the center. Because the disk rotation speed is constant, and the information density is constant, the raw >> performance of the disk is not the same for all cylinders. The performance degradation can be up to 50% when comparing performances at >> the beginning of the disk and at the end of the disk. >> >> I creat a small partition on the disk (using the first 2000 cylinders of the total 17000 cylinders on the disk), mount ext3 file system on it and test the performances of different directories. The performances under different directories are nearly the same :) >> >> >> ======= 2008-01-03 19:06:11 ????????======= >> >> >>> ext3-usershello all, >>> >>> I am testing ext3 file system recently but find some problem >>> I use GreatTurbo Enterprise Server 10 (Zuma) and 2.6.20 kernel >>> I conducted my test as follows: >>> >>> mkfs.ext3 /dev/sdb1 >>> mount /dev/sdb1 /mnt/test >>> cd /mnt/test >>> mkdir 0 1 2 3 4 5 6 7 >>> >>> I test write and read performance under different subdirs and give the performance result. I also use filefrag to see the file layout. >>> >>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/0/tmpfile -c -e -+n -w >>> 5242880 1024 72706 0 80474 0 >>> /mnt/test/0/tmpfile: 44 extents found, perfection would be 41 extents >>> >>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/1/tmpfile -c -e -+n -w >>> 5242880 1024 49957 0 52899 0 >>> /mnt/test/1/tmpfile: 42 extents found, perfection would be 41 extents >>> >>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/2/tmpfile -c -e -+n -w >>> 5242880 1024 60292 0 64664 0 >>> /mnt/test/2/tmpfile: 42 extents found, perfection would be 41 extents >>> >>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/3/tmpfile -c -e -+n -w >>> 5242880 1024 70540 0 78644 0 >>> /mnt/test/3/tmpfile: 46 extents found, perfection would be 41 extents >>> >>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/4/tmpfile -c -e -+n -w >>> 5242880 1024 61334 0 67778 0 >>> /mnt/test/4/tmpfile: 44 extents found, perfection would be 41 extents >>> >>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile -c -e -+n -w >>> 5242880 1024 66735 0 75114 0 >>> /mnt/test/5/tmpfile: 42 extents found, perfection would be 41 extents >>> >>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/6/tmpfile -c -e -+n -w >>> 5242880 1024 65062 0 72686 0 >>> /mnt/test/6/tmpfile: 44 extents found, perfection would be 41 extents >>> >>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/7/tmpfile -c -e -+n -w >>> 5242880 1024 69247 0 78563 0 >>> /mnt/test/7/tmpfile: 45 extents found, perfection would be 41 extents >>> >>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/tmpfile -c -e -+n -w >>> 5242880 1024 77085 0 81696 0 >>> >>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile2 -c -e -+n -w >>> /mnt/test/5/tmpfile2: 48 extents found, perfection would be 41 extents >>> 5242880 1024 57776 0 64870 0 >>> >>> Command line used: /root/iozone -s 5g -r 1m -i 0 -i 1 -f /mnt/test/5/tmpfile3 -c -e -+n -w >>> 5242880 1024 54799 0 59145 0 >>> /mnt/test/5/tmpfile3: 44 extents found, perfection would be 41 extents >>> >>> My questions are: >>> 1. why the performances under different subdirs varies so much? In /mnt/test/0 the performance is 72/80, while in /mnt/test/1 the performance is 49/53 >>> 2. I see that the extents of all files are nearly the same, but their performances are different. What are the other factors that influence the performance except for the extents(fragmentation) of the file? >>> 3. Is it true that the more files there already exists in a dir, the lower performance we will get if we test under the dir? >>> as in my test, the performance of /mnt/test/5/tmpfile is 66/75, while the performances of /mnt/test/5/tmpfile2 and tmpfile3 are 57/64 54/59 >>> >>> Thanks very much >>> >>> >>> >>> >>> >> >> = = = = = = = = = = = = = = = = = = = = >> >> >> ????????? >> ?? >> >> >> ????????liuyue >> ????????liuyue at ncic.ac.cn >> ??????????2008-01-03 >> >> >> _______________________________________________ >> Ext3-users mailing list >> Ext3-users at redhat.com >> https://www.redhat.com/mailman/listinfo/ext3-users >> >> = = = = = = = = = = = = = = = = = = = = ????????? ?? ????????liuyue ????????liuyue at ncic.ac.cn ??????????2008-01-07 From lakshmipathi.g at gmail.com Mon Jan 7 04:53:46 2008 From: lakshmipathi.g at gmail.com (lakshmi pathi) Date: Mon, 7 Jan 2008 10:23:46 +0530 Subject: How to flush file system buffers? Message-ID: Hi all, I want to know whether there is any system call available to flush all ext3 file system buffer(especially the inode cache buffer ) to disk. I tried sync() but seems like not working for me.Any thoughts? -Laks -------------- next part -------------- An HTML attachment was scrubbed... URL: From hayim at iportent.com Thu Jan 3 12:46:47 2008 From: hayim at iportent.com (Hayim Shaul) Date: Thu, 03 Jan 2008 14:46:47 +0200 Subject: read() on a deleted file In-Reply-To: <80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com> References: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com> <80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com> Message-ID: <1199364407.2930.4.camel@alon> On Thu, 2008-01-03 at 16:42 +0530, Fasihullah Askiri wrote: > Thanx for the response. That is why I tried shred-ding the file. I > believe that shred overwrites the file inode, if so, shred should have > led to failures of read() which is not the case. How does that happen? > What do you mean by re-writing? Do you mean opening a new file with the same name and writing into it? i don't think the new file (necessarily) gets the same inode as the file you deleted. More specifically, while the inode of the "deleted" file still exists, the new inode would most likely to be different. From darkonc at gmail.com Mon Jan 7 22:56:17 2008 From: darkonc at gmail.com (Stephen Samuel) Date: Mon, 7 Jan 2008 14:56:17 -0800 Subject: read() on a deleted file In-Reply-To: <1199364407.2930.4.camel@alon> References: <80cd17810801030230s6adb4e38w311decb927268780@mail.gmail.com> <80cd17810801030312y2451b001o1b0948a5be19d8dc@mail.gmail.com> <1199364407.2930.4.camel@alon> Message-ID: <6cd50f9f0801071456l75516eebmbb8bc0bd34dc5055@mail.gmail.com> Most likely??! Until you delete all links and close all open file descriptors and the Inode is deallocated, you are GUARANTEED to get a different inode if you create a new file (of any name). On Jan 3, 2008 4:46 AM, Hayim Shaul wrote: > On Thu, 2008-01-03 at 16:42 +0530, Fasihullah Askiri wrote: > > Thanx for the response. That is why I tried shred-ding the file. I > > believe that shred overwrites the file inode, if so, shred should have > > led to failures of read() which is not the case. How does that happen? > > > > What do you mean by re-writing? > Do you mean opening a new file with the same name and writing into it? > > i don't think the new file (necessarily) gets the same inode as the file > you deleted. > More specifically, while the inode of the "deleted" file still exists, > the new inode would most likely to be different. > > -- Stephen Samuel http://www.bcgreen.com 778-861-7641 -------------- next part -------------- An HTML attachment was scrubbed... URL: From amillionlobsters at gmail.com Fri Jan 11 04:32:37 2008 From: amillionlobsters at gmail.com (Paul d'Aoust) Date: Thu, 10 Jan 2008 20:32:37 -0800 Subject: root inode corrupted; tries to clear and reallocate, but can't Message-ID: <78dcc8220801102032x2ea66985x6ab43286f824d3b5@mail.gmail.com> Hi there. I think I fscked up my filesystem (betcha nobody's used that one before!). I made the mistake of fscking an online ext3 filesystem (guess I wasn't paying attention or I was sick of it being so paranoid or something) and quickly discovered why I'm not supposed to do that. The root inode somehow got corrupted, and a whole bunch of inodes started claiming the same blocks. Here's the result of my attempt to mount: mount: wrong fs type, bad option, bad superblock on /dev/hda1, missing codepage or helper program, or other error. In some cases useful info is found in syslog -- try dmesg | tail or so So 'dmesg' reveals this: EXT3-fs: corrupt root inode, run e2fsck Then, when I run e2fsck, the first thing it says is Root inode is not a directory. Clear? I say 'yes', and then it proceeds to correct and then delete the parent entry for every inode in the root directory (owing to the fact that their parent, inode 2, has just been cleared). Here's the exact wording: Missing '..' in directory inode 5406734. Fix? yes Entry '..' in ... (5406734) has deleted/unused inode 2. Clear? yes Then, in pass 3, when it tries to repair the root inode, it says Root inode not allocated. Allocate? yes Error creating root directory (extfs_new_block): Could not allocate block in ext2 filesystem e2fsck: aborted Now, I know I have more than just a couple free blocks, partly because debug2fs says so, and partly because I've tried deleting inodes and freeing up blocks. Some I deleted when e2fsck asked me if I wanted to clone or delete the multiply-claimed blocks, and some I deleted by using 'iclr' in debug2fs. I've tried unallocating the root inode and its block manually, and it still says it can't allocate any block in the filesystem when it tries to rebuild the root inode. If anybody has some insight or suggestions, I would love to hear them! Thanks in advance, Paul d'Aoust From jss at ast.cam.ac.uk Fri Jan 11 11:54:46 2008 From: jss at ast.cam.ac.uk (Jeremy Sanders) Date: Fri, 11 Jan 2008 11:54:46 +0000 Subject: Checksumming layer Message-ID: Is there any sort of checksumming layer that could lie between the disk and ext3, or be implemented as part of ext3/4? We've just had a couple of drives recently where the drive started silently corrupting the data without generating any I/O or SMART errors. This is pretty disastrous as you don't necessarily find out about the corruption until it is too late. I imagine the overhead of such a layer wouldn't be that much. I would pay a few percent performance for knowing that the data is not corrupt. Jeremy -- Jeremy Sanders http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 From lists at nerdbynature.de Fri Jan 11 12:26:40 2008 From: lists at nerdbynature.de (Christian Kujau) Date: Fri, 11 Jan 2008 13:26:40 +0100 (CET) Subject: Checksumming layer In-Reply-To: References: Message-ID: <50295.62.180.231.196.1200054400.squirrel@housecafe.dyndns.org> On Fri, January 11, 2008 12:54, Jeremy Sanders wrote: > Is there any sort of checksumming layer that could lie between the disk > and ext3, or be implemented as part of ext3/4? http://www.bullopensource.org/ext4/files/ext4.txt notes: * journal checksumming for robustness, performance (prototype exists) Features like metadata checksumming have been discussed and planned for a bit but no patches exist yet so I'm not sure they're in the near-term roadmap. ...but apart from that, only ZFS comes to mind: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Checksums C. -- make bzImage, not war From jprats at cesca.es Fri Jan 11 12:38:21 2008 From: jprats at cesca.es (Jordi Prats) Date: Fri, 11 Jan 2008 13:38:21 +0100 Subject: Checksumming layer In-Reply-To: References: Message-ID: <4787633D.5070905@cesca.es> Hi, You could use tripwire to check periodically all files instead of relay on the file system for that task. (I think no file system does this checking by now) Jordi Jeremy Sanders wrote: > Is there any sort of checksumming layer that could lie between the disk and > ext3, or be implemented as part of ext3/4? > > We've just had a couple of drives recently where the drive started silently > corrupting the data without generating any I/O or SMART errors. This is > pretty disastrous as you don't necessarily find out about the corruption > until it is too late. > > I imagine the overhead of such a layer wouldn't be that much. I would pay a > few percent performance for knowing that the data is not corrupt. > > Jeremy > > -- ...................................................................... __ / / Jordi Prats C E / S / C A Dept. de Sistemes /_/ Centre de Supercomputaci? de Catalunya Gran Capit?, 2-4 (Edifici Nexus) ? 08034 Barcelona T. 93 205 6464 ? F. 93 205 6979 ? jprats at cesca.es ...................................................................... From jss at ast.cam.ac.uk Fri Jan 11 12:44:31 2008 From: jss at ast.cam.ac.uk (Jeremy Sanders) Date: Fri, 11 Jan 2008 12:44:31 +0000 Subject: Checksumming layer References: <4787633D.5070905@cesca.es> Message-ID: Jordi Prats wrote: > You could use tripwire to check periodically all files instead of relay > on the file system for that task. (I think no file system does this > checking by now) That's a possible idea. I would have thought it would be relatively simple to write a block device which acted a layer between the file system and real block device. I suppose the difficultly is getting all the corner cases correct. I've never written any kernel code, so maybe I should investigate doing that for fun... Jeremy -- Jeremy Sanders http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 From tweeks at rackspace.com Fri Jan 11 19:55:46 2008 From: tweeks at rackspace.com (tweeks) Date: Fri, 11 Jan 2008 13:55:46 -0600 Subject: Checksumming layer In-Reply-To: References: <4787633D.5070905@cesca.es> Message-ID: <200801111355.47649.tweeks@rackspace.com> On Friday 11 January 2008 06:44, Jeremy Sanders wrote: > Jordi Prats wrote: > > You could use tripwire to check periodically all files instead of relay > > on the file system for that task. (I think no file system does this > > checking by now) > > That's a possible idea. > > I would have thought it would be relatively simple to write a block device > which acted a layer between the file system and real block device. I > suppose the difficultly is getting all the corner cases correct. I've never > written any kernel code, so maybe I should investigate doing that for > fun... All files in the system are already hashed. You can see this by doing an "rpm -Va". For example.. to create a baseline of a system to compare against, just cron a script to: rpm -Va > /root/RPMV/system-rpm-baseline.txt then once/day or whatever, do a diff... or just grep for any "bin" directory changes and diff that. I like this better than messing with tripwire. It's already there, native, and easy to use. Tweeks Confidentiality Notice: This e-mail message (including any attached or embedded documents) is intended for the exclusive and confidential use of the individual or entity to which this message is addressed, and unless otherwise expressly indicated, is confidential and privileged information of Rackspace Managed Hosting. Any dissemination, distribution or copying of the enclosed material is prohibited. If you receive this transmission in error, please notify us immediately by e-mail at abuse at rackspace.com, and delete the original message. Your cooperation is appreciated. From forest at alittletooquiet.net Fri Jan 11 20:13:11 2008 From: forest at alittletooquiet.net (Forest Bond) Date: Fri, 11 Jan 2008 15:13:11 -0500 Subject: Checksumming layer In-Reply-To: <200801111355.47649.tweeks@rackspace.com> References: <4787633D.5070905@cesca.es> <200801111355.47649.tweeks@rackspace.com> Message-ID: <20080111201311.GC21140@storm.local.network> Hi, On Fri, Jan 11, 2008 at 01:55:46PM -0600, tweeks wrote: > On Friday 11 January 2008 06:44, Jeremy Sanders wrote: > > Jordi Prats wrote: > > > You could use tripwire to check periodically all files instead of relay > > > on the file system for that task. (I think no file system does this > > > checking by now) > > > > That's a possible idea. > > > > I would have thought it would be relatively simple to write a block device > > which acted a layer between the file system and real block device. I > > suppose the difficultly is getting all the corner cases correct. I've never > > written any kernel code, so maybe I should investigate doing that for > > fun... > > All files in the system are already hashed. You can see this by doing > an "rpm -Va". For example.. to create a baseline of a system to compare > against, just cron a script to: > rpm -Va > /root/RPMV/system-rpm-baseline.txt > > then once/day or whatever, do a diff... or just grep for any "bin" directory > changes and diff that. I like this better than messing with tripwire. It's > already there, native, and easy to use. This is specific to: * RPM-based systems * files provided by RPMs Consequently, it's only useful on certain systems, and, even then, only with certain files. That's not very good coverage, is it? This is especially true when you consider that the files that came from the package manager are usually the ones that you don't care about as much when you've lost data. -Forest -- Forest Bond http://www.alittletooquiet.net -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: Digital signature URL: From adilger at sun.com Fri Jan 11 22:09:20 2008 From: adilger at sun.com (Andreas Dilger) Date: Fri, 11 Jan 2008 15:09:20 -0700 Subject: Checksumming layer In-Reply-To: References: <4787633D.5070905@cesca.es> Message-ID: <20080111220920.GU3351@webber.adilger.int> On Jan 11, 2008 12:44 +0000, Jeremy Sanders wrote: > I would have thought it would be relatively simple to write a block device > which acted a layer between the file system and real block device. I > suppose the difficultly is getting all the corner cases correct. I've never > written any kernel code, so maybe I should investigate doing that for > fun... I think at one point there was a checksumming loop driver, and adding a checksumming mechanism to DM wouldn't be so hard either. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From tweeks at rackspace.com Fri Jan 11 22:52:36 2008 From: tweeks at rackspace.com (tweeks) Date: Fri, 11 Jan 2008 16:52:36 -0600 Subject: Checksumming layer In-Reply-To: <20080111201311.GC21140@storm.local.network> References: <200801111355.47649.tweeks@rackspace.com> <20080111201311.GC21140@storm.local.network> Message-ID: <200801111652.37607.tweeks@rackspace.com> On Friday 11 January 2008 14:13, Forest Bond wrote: > Hi, > > On Fri, Jan 11, 2008 at 01:55:46PM -0600, tweeks wrote: > > On Friday 11 January 2008 06:44, Jeremy Sanders wrote: > > > Jordi Prats wrote: > > > > You could use tripwire to check periodically all files instead of > > > > relay on the file system for that task. (I think no file system does > > > > this checking by now) > > > > > > That's a possible idea. > > > > > > I would have thought it would be relatively simple to write a block > > > device which acted a layer between the file system and real block > > > device. I suppose the difficultly is getting all the corner cases > > > correct. I've never written any kernel code, so maybe I should > > > investigate doing that for fun... > > > > All files in the system are already hashed. You can see this by doing > > an "rpm -Va". For example.. to create a baseline of a system to compare > > against, just cron a script to: > > rpm -Va > /root/RPMV/system-rpm-baseline.txt > > > > then once/day or whatever, do a diff... or just grep for any "bin" > > directory changes and diff that. I like this better than messing with > > tripwire. It's already there, native, and easy to use. > > This is specific to: > > * RPM-based systems > * files provided by RPMs > Consequently, it's only useful on certain systems, Heh.. well.. last I checked, this is a redhat ext3 list. Red hat uses rpm.. and no one but Red hat still actually uses ext3 right? (hehe)... > and, even then, only > with certain files. That's not very good coverage, is it? Uhh.. all SYSTEM files.. which is all I'm looking at when doing compromise checks (except for root kits, etc.. for which I use separate tools). > This is especially true when you consider that the files that came from the > package manager are usually the ones that you don't care about as much when > you've lost data. You tripwire scan data files? Hmm.. I've seen hundred of compromised servers... 80-90% of them can be detected with a simple RPM scan. The ones you can't are the ones where hacks have deleted the RPM DBs. but in that case, your baseline diff sets off red flags anyway. It's actually a pretty good scan to run nightly/weekly, etc (along with root kit scans, etc). In fact.. I prefer using unorthodox detection methods rather than well known forms of F.A.M. (file alteration monitoring) like tripwire which if seen, are instantly attacked and disabled. Tweeks Confidentiality Notice: This e-mail message (including any attached or embedded documents) is intended for the exclusive and confidential use of the individual or entity to which this message is addressed, and unless otherwise expressly indicated, is confidential and privileged information of Rackspace Managed Hosting. Any dissemination, distribution or copying of the enclosed material is prohibited. If you receive this transmission in error, please notify us immediately by e-mail at abuse at rackspace.com, and delete the original message. Your cooperation is appreciated. From jack at suse.cz Mon Jan 14 17:06:09 2008 From: jack at suse.cz (Jan Kara) Date: Mon, 14 Jan 2008 18:06:09 +0100 Subject: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66) In-Reply-To: <477BF72B.4000608@oracle.com> References: <200712242302.lBON2O8s011190@agora.fsl.cs.sunysb.edu> <477BF72B.4000608@oracle.com> Message-ID: <20080114170609.GH4214@duck.suse.cz> On Wed 02-01-08 12:42:19, Zach Brown wrote: > Erez Zadok wrote: > > Setting: ltp-full-20071031, dio01 test on ext3 with Linus's latest tree. > > Kernel w/ SMP, preemption, and lockdep configured. > > This is a real lock ordering problem. Thanks for reporting it. > > The updating of atime inside sys_mmap() orders the mmap_sem in the vfs > outside of the journal handle in ext3's inode dirtying: > > > -> #1 (jbd_handle){--..}: > > [] __lock_acquire+0x9cc/0xb95 > > [] lock_acquire+0x5f/0x78 > > [] journal_start+0xee/0xf8 > > [] ext3_journal_start_sb+0x48/0x4a > > [] ext3_dirty_inode+0x27/0x6c > > [] __mark_inode_dirty+0x29/0x144 > > [] touch_atime+0xb7/0xbc > > [] generic_file_mmap+0x2d/0x42 > > [] mmap_region+0x1e6/0x3b4 > > [] do_mmap_pgoff+0x1fb/0x253 > > [] sys_mmap2+0x9b/0xb5 > > [] syscall_call+0x7/0xb > > [] 0xffffffff > > ext3_direct_IO() orders the journal handle outside of the mmap_sem that > dio_get_page() acquires to pin pages with get_user_pages(): > > > -> #0 (&mm->mmap_sem){----}: > > [] __lock_acquire+0x8bc/0xb95 > > [] lock_acquire+0x5f/0x78 > > [] down_read+0x3a/0x4c > > [] dio_get_page+0x4e/0x15d > > [] __blockdev_direct_IO+0x431/0xa81 > > [] ext3_direct_IO+0x10c/0x1a1 > > [] generic_file_direct_IO+0x124/0x139 > > [] generic_file_direct_write+0x56/0x11c > > [] __generic_file_aio_write_nolock+0x33d/0x489 > > [] generic_file_aio_write+0x58/0xb6 > > [] ext3_file_write+0x27/0x99 > > [] do_sync_write+0xc5/0x102 > > [] vfs_write+0x90/0x119 > > [] sys_write+0x3d/0x61 > > [] sysenter_past_esp+0x5f/0xa5 > > [] 0xffffffff > > Two fixes come to mind: > > 1) use something like Peter's ->mmap_prepare() to update atime before > acquiring the mmap_sem. ( http://lkml.org/lkml/2007/11/11/97 ). I > don't know if this would leave more paths which do a journal_start() > while holding the mmap_sem. > > 2) rework ext3's dio to only hold the jbd handle in ext3_get_block(). > Chris has a patch for this kicking around somewhere but I'm told it has > problems exposing old blocks in ordered data mode. > > Does anyone have preferences? I could go either way. I certainly don't > like the idea of journal handles being held across the entirety of > fs/direct-io.c. It's yet another case of O_DIRECT differing wildly from > the buffered path :(. I've looked more into it and I think that 2) is the only way to go since transaction start ranks below page lock (standard buffered write path) and page lock ranks below mmap_sem. So we have at least one more dependency mmap_sem must go before transaction start... Honza -- Jan Kara SUSE Labs, CR From giancarlo.corti at supsi.ch Tue Jan 22 16:01:50 2008 From: giancarlo.corti at supsi.ch (giancarlo corti) Date: Tue, 22 Jan 2008 17:01:50 +0100 Subject: forced fsck (again?) Message-ID: <200801221701.50202.giancarlo.corti@supsi.ch> hello everyone. i guess this has been asked before, but haven't found it in the faq. i have the following issue... it is not uncommon nowadays to have desktops with filesystems in the order of 500gb/1tb. now, my kubuntu (but other distros do the same) forces a fsck on ext3 every so often, no matter what. in the past it wasn't a big issue. but with sizes increasing so much, users are now forced to wait for several minutes (every so often) for their desktops to boot up. to the point that the thing has become unacceptable. i know i can tune/disable this, but i'd like to understand once and for all what is the technical rationale behind this practice and what use is there to force a fsck on a clean fs... i must be missing something... :-( thanks in advance. cheers. From lm at bitmover.com Tue Jan 22 16:08:59 2008 From: lm at bitmover.com (Larry McVoy) Date: Tue, 22 Jan 2008 08:08:59 -0800 Subject: forced fsck (again?) In-Reply-To: <200801221701.50202.giancarlo.corti@supsi.ch> References: <200801221701.50202.giancarlo.corti@supsi.ch> Message-ID: <20080122160859.GA25057@bitmover.com> > i know i can tune/disable this, but i'd like to understand once > and for all what is the technical rationale behind this practice > and what use is there to force a fsck on a clean fs... Disks rot. -- --- Larry McVoy lm at bitmover.com http://www.bitkeeper.com From sandeen at redhat.com Tue Jan 22 16:10:38 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 22 Jan 2008 10:10:38 -0600 Subject: forced fsck (again?) In-Reply-To: <200801221701.50202.giancarlo.corti@supsi.ch> References: <200801221701.50202.giancarlo.corti@supsi.ch> Message-ID: <4796157E.5040803@redhat.com> giancarlo corti wrote: > hello everyone. > > i guess this has been asked before, but haven't found it in the faq. > > i have the following issue... > > it is not uncommon nowadays to have desktops with filesystems > in the order of 500gb/1tb. > > now, my kubuntu (but other distros do the same) forces a fsck > on ext3 every so often, no matter what. Did you just update to e2fsprogs-1.40.3? If so, should be fixed in 1.40.4 for the most part. See Debian bug 454926, http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=454926 -Eric From sandeen at redhat.com Tue Jan 22 16:11:54 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Tue, 22 Jan 2008 10:11:54 -0600 Subject: forced fsck (again?) In-Reply-To: <4796157E.5040803@redhat.com> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> Message-ID: <479615CA.1090408@redhat.com> Eric Sandeen wrote: > giancarlo corti wrote: >> hello everyone. >> >> i guess this has been asked before, but haven't found it in the faq. >> >> i have the following issue... >> >> it is not uncommon nowadays to have desktops with filesystems >> in the order of 500gb/1tb. >> >> now, my kubuntu (but other distros do the same) forces a fsck >> on ext3 every so often, no matter what. > > Did you just update to e2fsprogs-1.40.3? > > If so, should be fixed in 1.40.4 for the most part. > > See Debian bug 454926, > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=454926 Oh - or, if it's not running each time, and it's just a general question about periodic fscks, then yeah, what Larry said, I guess. Although not all filesystems do this. -Eric From val.henson at gmail.com Tue Jan 22 22:34:35 2008 From: val.henson at gmail.com (Valerie Henson) Date: Tue, 22 Jan 2008 14:34:35 -0800 Subject: forced fsck (again?) In-Reply-To: <479615CA.1090408@redhat.com> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> Message-ID: <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> On Jan 22, 2008 8:11 AM, Eric Sandeen wrote: > Eric Sandeen wrote: > > giancarlo corti wrote: > >> hello everyone. > >> > >> i guess this has been asked before, but haven't found it in the faq. > >> > >> i have the following issue... > >> > >> it is not uncommon nowadays to have desktops with filesystems > >> in the order of 500gb/1tb. > >> > >> now, my kubuntu (but other distros do the same) forces a fsck > >> on ext3 every so often, no matter what. > > > > Did you just update to e2fsprogs-1.40.3? > > > > If so, should be fixed in 1.40.4 for the most part. > > > > See Debian bug 454926, > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=454926 > > Oh - or, if it's not running each time, and it's just a general question > about periodic fscks, then yeah, what Larry said, I guess. > > Although not all filesystems do this. This will be ironic coming from me, but I think the ext3 defaults for forcing a file system check are a little too conservative for many modern use cases. The two cases I have in mind in particular are: * Servers with long uptimes that need very low data unavailability times. Imagine you have a machine room full of servers that have all been up and running happily for more than 180 days - the preferred case. Now imagine that the room overheats and the emergency power cut is tripped. Standard heat reduction is swiftly applied (i.e., open the door and turn on a fan and hope security doesn't notice) and the power turned back on. Now your entire machine room will be fscking for the next 3 hours and whatever service they provide will be completely unavailable. Of course, any admin worth their salt will turn off force fsck so it only runs during controlled downtime... won't they? * Laptops. If suspend and resume doesn't work on your laptop, you'll be rebooting (and remounting) a lot, perhaps several times a day. The preferred solution is to get Matthew Garrett to fix your laptop, but if you can't, fscking every 10-30 days seems a little excessive. Desktop users who shutdown daily to save power will have similar problems. Distros often have the "don't fsck on battery" option and some don't use the ext3 defaults for mkfs, but that's only a partial solution. In this case, it's definitely a little much to ask a random laptop user to tune their file system. I'm not sure what the best solution is - print warnings for several days/mounts before the force fsck? print warnings but don't force fsck? increase the default days/mounts before force fsck? base force fsck intervals on write activity? - but in practice I find myself telling people about "tune2fs -c 0 -i 0" a lot. I use it on all my file systems and run fsck by hand every few months (or more often when I'm working on fsck :) ). Disks do rot, and file systems do get corrupted, and fsck should be run periodically, but the current system of frequent unpredictable forced fsck at boot is probably not the best cost/benefit tradeoff for many use cases. -VAL From tytso at MIT.EDU Tue Jan 22 22:52:48 2008 From: tytso at MIT.EDU (Theodore Tso) Date: Tue, 22 Jan 2008 17:52:48 -0500 Subject: forced fsck (again?) In-Reply-To: <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> Message-ID: <20080122225248.GD1659@mit.edu> On Tue, Jan 22, 2008 at 02:34:35PM -0800, Valerie Henson wrote: > This will be ironic coming from me, but I think the ext3 defaults for > forcing a file system check are a little too conservative for many > modern use cases. The two cases I have in mind in particular are: Yeah. To the extent that people are using devicemapper/LVM everywhere, there is a much better solution. To wit: #!/bin/sh # # e2croncheck VG=closure VOLUME=root SNAPSIZE=100m EMAIL=tytso at mit.edu TMPFILE=`mktemp -t e2fsck.log.XXXXXXXXXX` set -e START="$(date +'%Y%m%d%H%M%S')" lvcreate -s -L ${SNAPSIZE} -n "${VOLUME}-snap" "${VG}/${VOLUME}" if nice logsave -as $TMPFILE e2fsck -p -C 0 "/dev/${VG}/${VOLUME}-snap" && \ nice logsave -as $TMPFILE e2fsck -fy -C 0 "/dev/${VG}/${VOLUME}-snap" ; then echo 'Background scrubbing succeeded!' tune2fs -C 0 -T "${START}" "/dev/${VG}/${VOLUME}" else echo 'Background scrubbing failed! Reboot to fsck soon!' tune2fs -C 16000 -T "19000101" "/dev/${VG}/${VOLUME}" if test -n "EMAIL"; then mail -s "E2fsck of /dev/${VG}/${VOLUME} failed!" $EMAIL < $TMPFILE fi fi lvremove -f "${VG}/${VOLUME}-snap" rm $TMPFILE > * Servers with long uptimes that need very low data unavailability > times. Imagine you have a machine room full of servers that have all > been up and running happily for more than 180 days - the preferred > case. And the server should be checking the filesystem every month or so. But with the long, extended uptime, it doesn't happen. Using LVM and the above script solves that problem. > * Laptops. If suspend and resume doesn't work on your laptop, you'll > be rebooting (and remounting) a lot, perhaps several times a day. The > preferred solution is to get Matthew Garrett to fix your laptop, but > if you can't, fscking every 10-30 days seems a little excessive. It's sad that it's to get suspend/resume working. But yeah, it's either Matthew or someone like Nigel from the TuxOnIce lists to help you, or maybe a few other people. Checking from cron is I believe the right answer, here, too, as long as there is a check to make sure you're running on AC before doing the check. So ---- for someone who has time, I offer the following challenge. Take the above script, and enhance it in the following ways: * Read a configuration file to see which filesystem(s) to check and to which e-mail the error reports should be sent. * Have the script abort the check if the system appears to be running off of a battery. * Have the config file define a time period (say, 30 days), and have the script test to see if the last_mount time is greater than the time interval. If it is, then it does the check, otherwise it skips it. With these enhancements, in the laptop case the script could be fired off by cron every night at 3am, and if a month has gone by without a check, AND the laptop is running off the AC mains, the check happens automatically, in the background. > I'm not sure what the best solution is - print warnings for several > days/mounts before the force fsck? print warnings but don't force > fsck? increase the default days/mounts before force fsck? base force > fsck intervals on write activity? - but in practice I find myself > telling people about "tune2fs -c 0 -i 0" a lot. I use it on all my > file systems and run fsck by hand every few months (or more often when > I'm working on fsck :) ). Well, this isn't a complete solution, because a lot of people don't use LVM, often because they don't trust initrd's to do the right thing --- and quite frankly, I can't blame them. But doing this kind of thing is so much better that maybe it would actually help convert more kernel developers to use LVM on their boot filesystem. (Well, probably not. That's probably being too optimistic. :-) - Ted From bryan at kadzban.is-a-geek.net Wed Jan 23 01:50:33 2008 From: bryan at kadzban.is-a-geek.net (Bryan Kadzban) Date: Tue, 22 Jan 2008 20:50:33 -0500 Subject: forced fsck (again?) In-Reply-To: <20080122225248.GD1659@mit.edu> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> Message-ID: <47969D69.4060607@kadzban.is-a-geek.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Theodore Tso wrote: > So ---- for someone who has time, I offer the following challenge. > Take the above script, and enhance it in the following ways: > > * Read a configuration file to see which filesystem(s) to > check and to which e-mail the error reports should be sent. Add support for checking multiple FSes, too. :-) > * Have the script abort the check if the system appears to be > running off of a battery. Sort of. Much of this on_ac_power function was stolen from Debian's powermgmt_base package's on_ac_power script, but it doesn't support anything other than ACPI. (It checks the new sysfs power_supply class first, and the /proc/acpi/ac_adapter/ directory second.) If the function can't determine if AC power is available, the script assumes it's on battery, and exits; this is suboptimal for desktops, but good for laptops that don't have ACPI turned on for whatever reason. > * Have the config file define a time period (say, 30 days), > and have the script test to see if the last_mount time is > greater than the time interval. If it is, then it does the > check, otherwise it skips it. Well, this script looks at the last-check time, not the last-mount time. But close enough. > With these enhancements, in the laptop case the script could be fired > off by cron every night at 3am, and if a month has gone by without a > check, AND the laptop is running off the AC mains, the check happens > automatically, in the background. See the attached script (e2check) and sample config file (e2check.conf). :-) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHlp1nS5vET1Wea5wRA2XkAKC9vPadZzYxbBITFVkSUAntYGOk4QCg4+SZ QK+2xfdB7wtVF/J152S/P2s= =lhcS -----END PGP SIGNATURE----- -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: e2check URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: e2check.conf URL: From tytso at MIT.EDU Wed Jan 23 03:10:12 2008 From: tytso at MIT.EDU (Theodore Tso) Date: Tue, 22 Jan 2008 22:10:12 -0500 Subject: forced fsck (again?) In-Reply-To: <47969D69.4060607@kadzban.is-a-geek.net> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> Message-ID: <20080123031012.GD1320@mit.edu> On Tue, Jan 22, 2008 at 08:50:33PM -0500, Bryan Kadzban wrote: > See the attached script (e2check) and sample config file (e2check.conf). > :-) Just a few requests. First of all, can you send me a Signed-Off-By, so I can include it in future versions of e2fsprogs. See the SUBMITTING-PATCHES in the top-level of the e2fsprogs source tree. (It means the same thing as the Linux Kernel). > > * Have the script abort the check if the system appears to be > > running off of a battery. > > Sort of. Much of this on_ac_power function was stolen from Debian's > powermgmt_base package's on_ac_power script, but it doesn't support > anything other than ACPI. (It checks the new sysfs power_supply class > first, and the /proc/acpi/ac_adapter/ directory second.) > > If the function can't determine if AC power is available, the script > assumes it's on battery, and exits; this is suboptimal for desktops, but > good for laptops that don't have ACPI turned on for whatever reason. Yeah, the default needs to be the other way around for servers, which may not have the ac_adapter interface at all. > > * Have the config file define a time period (say, 30 days), > > and have the script test to see if the last_mount time is > > greater than the time interval. If it is, then it does the > > check, otherwise it skips it. > > Well, this script looks at the last-check time, not the last-mount time. > But close enough. Yeah, that's what I wanted. > See the attached script (e2check) and sample config file (e2check.conf). > :-) Hmm, if you're going to source the config file directly, why not do this instead: check_lvm_fs closure root 100m 30 check_lvm_fs closure home 100m 30 instead of this: > VGS=(closure closure) > VOLUMES=(root home) > SNAPSIZES=(100m 100m) > INTERVALS=(30 30) If you have six or eight volumes to check, keeping them lined up could be error-prone. Thanks for stepping up! - Ted From bryan at kadzban.is-a-geek.net Wed Jan 23 03:35:43 2008 From: bryan at kadzban.is-a-geek.net (Bryan Kadzban) Date: Tue, 22 Jan 2008 22:35:43 -0500 Subject: forced fsck (again?) In-Reply-To: <20080123031012.GD1320@mit.edu> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> Message-ID: <4796B60F.4040009@kadzban.is-a-geek.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Theodore Tso wrote: > First of all, can you send me a Signed-Off-By, Sure, I'll do that on the version that has other changes below. I realized after sending the first version that it probably also needed to be GPLed, as technically, the attachment before wasn't distributable at all. I meant it to be GPL, since it used bits of the powermgmt-base package from Debian. I assume that's the license implied, since I'm submitting it to be included in e2fsprogs? Or does GPLv2 need to be mentioned in the files as well? (Actually, what about GPL versions? It looks like e2fsprogs is still at GPL version 2 -- that's OK with me, but do I need to say "v2 only", "v2 or later", or nothing specific?) > Yeah, the default needs to be the other way around for servers, which > may not have the ac_adapter interface at all. Sounds like another config file setting, then. I can probably simplify the interface a bit if the decision is made by a config file setting, too. (I can make the function return 0 or 1 based on the config file if it falls through all the checks that are there, instead of returning 255 and making the caller handle it differently.) > Hmm, if you're going to source the config file directly, why not do > this instead: > > check_lvm_fs closure root 100m 30 > check_lvm_fs closure home 100m 30 Are you thinking that the check_lvm_fs calls would be in the config file (after setting global options), and the check_lvm_fs function would be defined in the main script? That's my guess here, and it'd probably work OK, but it'd take a bit of work. And it's getting late here, so I probably won't get it changed until at least tomorrow night. > If you have six or eight volumes to check, keeping them lined up > could be error-prone. That's true. I was looking for a way to do named array indices, like a hashtable in most other languages, but it doesn't look like bash has that ability. Plain old Bourne sh almost certainly doesn't. Anyway, calling the check_lvm_fs function from the config file is a little bit backwards, but would certainly work better than a bunch of arrays that all have to be in sync. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHlrYOS5vET1Wea5wRA/ZyAJ9zpvPzCFUl28KuNDJfv+G19yt2IwCgmB5p D9SIDhoJ3eF7khgXgb0WSXY= =ea4T -----END PGP SIGNATURE----- From darkonc at gmail.com Wed Jan 23 04:10:25 2008 From: darkonc at gmail.com (Stephen Samuel) Date: Tue, 22 Jan 2008 20:10:25 -0800 Subject: forced fsck (again?) In-Reply-To: <4796B60F.4040009@kadzban.is-a-geek.net> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> Message-ID: <6cd50f9f0801222010r374dfe5lcf7bc24b5d2ad82d@mail.gmail.com> On Jan 22, 2008 7:35 PM, Bryan Kadzban wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: RIPEMD160 > > (Actually, what about GPL versions? It looks like e2fsprogs is still at > GPL version 2 -- that's OK with me, but do I need to say "v2 only", "v2 > or later", or nothing specific?) > My suggestion is 'V2 or later', since that covers V2, V3 (and, eventually, Vs 4, 5, 6 and 7). -- Stephen Samuel http://www.bcgreen.com 778-861-7641 -------------- next part -------------- An HTML attachment was scrubbed... URL: From adilger at sun.com Wed Jan 23 08:15:48 2008 From: adilger at sun.com (Andreas Dilger) Date: Wed, 23 Jan 2008 01:15:48 -0700 Subject: forced fsck (again?) In-Reply-To: <4796B60F.4040009@kadzban.is-a-geek.net> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> Message-ID: <20080123081548.GY3180@webber.adilger.int> On Jan 22, 2008 22:35 -0500, Bryan Kadzban wrote: > > Hmm, if you're going to source the config file directly, why not do > > this instead: > > > > check_lvm_fs closure root 100m 30 > > check_lvm_fs closure home 100m 30 > > Are you thinking that the check_lvm_fs calls would be in the config file > (after setting global options), and the check_lvm_fs function would be > defined in the main script? That's my guess here, and it'd probably > work OK, but it'd take a bit of work. And it's getting late here, so I > probably won't get it changed until at least tomorrow night. It probably makes more sense just to parse /etc/fstab and check the filesystems that have PASS != 0 (column 6), since those are the filesystems that will be automatically checked on the next boot. This also avoids more configuration by the user, which is always desirable. The second benefit of parsing /etc/fstab is that the filesystem type can be checked and "fsck.{fstype}" used (if available) instead of just "e2fsck". Alternately, using "lvscan" to check for mounted LVM filesystems and their fstype is another option, since there is no guarantee that all filesystems listed in /etc/fstab are on LVM. That's what I did in a very old, but similar, script: http://osdir.com/ml/linux.lvm.devel/2003-04/msg00001.html The only unfortunate thing is that I was revalidating this script still works with LVM2 on my system, and created an LV snapshot (worked OK), but when I tried to lvremove it immediately thereafter the system went into 100% IO wait and the lvremove process was unkillable :-(. This was the 2.6.16 SLES10 kernel, so that may have been fixed in the meantime... The LVM functions used in this script still appear to be working with LVM2, so I think it is still a valid approach. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From adilger at sun.com Wed Jan 23 09:16:01 2008 From: adilger at sun.com (Andreas Dilger) Date: Wed, 23 Jan 2008 02:16:01 -0700 Subject: forced fsck (again?) In-Reply-To: <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> Message-ID: <20080123091601.GZ3180@webber.adilger.int> On Jan 22, 2008 14:34 -0800, Valerie Henson wrote: > I'm not sure what the best solution is - print warnings for several > days/mounts before the force fsck? print warnings but don't force > fsck? increase the default days/mounts before force fsck? I believe current e2fsprogs already prints the number of mounts remaining before e2fsck is forced, though this doesn't help for time-based checks with a long system uptime. Conversely, I think for users that have set "-c 0 -i 0" e2fsck should print a message like "fs mounted 50 times, last e2fsck was 200 days ago" or similar, if the default limits are exceeded to alert the user that this might be an issue. > base force fsck intervals on write activity? I had submitted a patch ages ago that considered "clean" unmounts less dangerous than "crash" and only incremented the mount count about 1/5 times in that case (randomly). > Disks do rot, and file systems do get corrupted, and fsck should be > run periodically, but the current system of frequent unpredictable > forced fsck at boot is probably not the best cost/benefit tradeoff for > many use cases. Maybe some of the distro folks (Eric? :-) will pick up on this thread and consider adding the "e2fsck snapshot" script to cron.monthly or similar. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From sandeen at redhat.com Wed Jan 23 14:05:21 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Wed, 23 Jan 2008 08:05:21 -0600 Subject: forced fsck (again?) In-Reply-To: <20080123091601.GZ3180@webber.adilger.int> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080123091601.GZ3180@webber.adilger.int> Message-ID: <479749A1.5040208@redhat.com> Andreas Dilger wrote: > Maybe some of the distro folks (Eric? :-) will pick up on this thread and > consider adding the "e2fsck snapshot" script to cron.monthly or similar. I'm watching.... sure, that might be a candidate for Fedora. Ideally it'd be part of e2fsprogs, so we're not carrying/maintaining stuff that's not upstream. But Fedora does install onto lvm by default, so it sounds like a good candidate for fedora. -Eric > Cheers, Andreas From tytso at MIT.EDU Wed Jan 23 14:08:47 2008 From: tytso at MIT.EDU (Theodore Tso) Date: Wed, 23 Jan 2008 09:08:47 -0500 Subject: forced fsck (again?) In-Reply-To: <20080123081548.GY3180@webber.adilger.int> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> Message-ID: <20080123140847.GB29321@mit.edu> On Wed, Jan 23, 2008 at 01:15:48AM -0700, Andreas Dilger wrote: > It probably makes more sense just to parse /etc/fstab and check the > filesystems that have PASS != 0 (column 6), since those are the > filesystems that will be automatically checked on the next boot. This > also avoids more configuration by the user, which is always desirable. I thought of that, but given that you need to configure the e-mail to send reports, and the snapshot size, we need another configuration file anyway. (We could sneek some of that information into the options field of fstab, since the kernel and other programs that parse that field just take what they need and ignore the rest, but.... ick, ick, ick. :-) Also, I could imagine that a user might not want to check all of the filesystems in fstab. > Alternately, using "lvscan" to check for mounted LVM filesystems and > their fstype is another option, since there is no guarantee that all > filesystems listed in /etc/fstab are on LVM. That's what I did in a > very old, but similar, script: > > http://osdir.com/ml/linux.lvm.devel/2003-04/msg00001.html I do like the fact that your script does much better error checking than mine. :-) - Ted From adilger at sun.com Wed Jan 23 19:23:34 2008 From: adilger at sun.com (Andreas Dilger) Date: Wed, 23 Jan 2008 12:23:34 -0700 Subject: forced fsck (again?) In-Reply-To: <20080123140847.GB29321@mit.edu> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> Message-ID: <20080123192334.GG3180@webber.adilger.int> On Jan 23, 2008 09:08 -0500, Theodore Tso wrote: > I thought of that, but given that you need to configure the e-mail to > send reports, and the snapshot size, we need another configuration > file anyway. (We could sneek some of that information into the > options field of fstab, since the kernel and other programs that parse > that field just take what they need and ignore the rest, but.... ick, > ick, ick. :-) I agree - adding email to fstab is icky and I wouldn't go there. I don't see a problem with just emailing it to "root@" by default and giving the user the option to change it to something else. > Also, I could imagine that a user might not want to check all of the > filesystems in fstab. Similarly, a config file which disables checking on some LV if specified seems reasonable. IMHO the main goal is to make things transparent to the user and avoid their annoyance of "e2fsck at boot". Since the e2fsck is on a read-only LV snapshot, there shouldn't be any danger to the filesystems. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From bryan at kadzban.is-a-geek.net Thu Jan 24 02:10:31 2008 From: bryan at kadzban.is-a-geek.net (Bryan Kadzban) Date: Wed, 23 Jan 2008 21:10:31 -0500 Subject: forced fsck (again?) In-Reply-To: <20080123192334.GG3180@webber.adilger.int> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> Message-ID: <4797F397.9020306@kadzban.is-a-geek.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Andreas Dilger wrote: > On Jan 23, 2008 09:08 -0500, Theodore Tso wrote: >> (We could sneek some of that information into the options field of >> fstab, since the kernel and other programs that parse that field >> just take what they need and ignore the rest, but.... ick, ick, >> ick. :-) > > I agree - adding email to fstab is icky and I wouldn't go there. I > don't see a problem with just emailing it to "root@" by default and > giving the user the option to change it to something else. Since the email address is not per-filesystem, it's fine by me to put it into a config file somewhere. Forcing the interval to be global is probably also OK, although I wouldn't want to be forced to set the snapshot size globally. I do think that fstab is the best place for per-filesystem options, though. But it's not too difficult to parse out a custom SNAPSIZE option, and even have a DEFAULT_SNAPSIZE in the config file if no SNAPSIZE option is present on any LV, if the script is going to parse fstab anyway. (Or should the option's name be lowercase? Either will work.) >> Also, I could imagine that a user might not want to check all of >> the filesystems in fstab. > > Similarly, a config file which disables checking on some LV if > specified seems reasonable. That does seem reasonable, but I haven't done it in the script that's attached. Maybe support for a SKIP (or skip, or e2check_skip, or skip_e2check, or whatever) option in fstab's options field? Regarding the idea of having this support multiple filesystems -- that's a good idea, I think, but the current script is highly specific to ext2 or ext3. Use of tune2fs (to reset the last-check time) and dumpe2fs (to find the last-check time), in particular, will be problematic on other FSes. I haven't done that in this script, though it may be possible. Anyway, here's a second version. I've changed it to parse up fstab, and added an option for what to do if AC status can't be determined. Kernel-style changelog entry, etc., below: - ------- Create a script to transparently run e2fsck in the background on any LVM logical volumes listed in /etc/fstab, as long as the machine is on AC power, and that LV has been last checked more than a configurable number of days ago. Also create a configuration file to set various options in the script. Signed-Off-By: Bryan Kadzban -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHl/OXS5vET1Wea5wRA/UaAJwIE27W6qasI7Gm/uvZm/pY1rcBtwCcDXYq cc3qE/uOEqm4ksYHlI6+IJU= =7Lf3 -----END PGP SIGNATURE----- -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: e2check URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: e2check.conf URL: From adilger at sun.com Thu Jan 24 04:39:30 2008 From: adilger at sun.com (Andreas Dilger) Date: Wed, 23 Jan 2008 21:39:30 -0700 Subject: forced fsck (again?) In-Reply-To: <4797F397.9020306@kadzban.is-a-geek.net> References: <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> Message-ID: <20080124043930.GG18433@webber.adilger.int> On Jan 23, 2008 21:10 -0500, Bryan Kadzban wrote: > Since the email address is not per-filesystem, it's fine by me to put it > into a config file somewhere. Forcing the interval to be global is > probably also OK, although I wouldn't want to be forced to set the > snapshot size globally. I do think that fstab is the best place for > per-filesystem options, though. > > But it's not too difficult to parse out a custom SNAPSIZE option, and > even have a DEFAULT_SNAPSIZE in the config file if no SNAPSIZE option is > present on any LV, if the script is going to parse fstab anyway. (Or > should the option's name be lowercase? Either will work.) The problem with this is that ext2/3/4, along with most other filesystems will fail to mount if passed an unknown mount option. > Regarding the idea of having this support multiple filesystems -- that's > a good idea, I think, but the current script is highly specific to ext2 > or ext3. Use of tune2fs (to reset the last-check time) and dumpe2fs (to > find the last-check time), in particular, will be problematic on other > FSes. I haven't done that in this script, though it may be possible. Well, my equivalent script just checks for fsck.${fstype} and runs that on the snapshot, if available. Even if tune2fs isn't there to update a "last checked" field, it is still a useful indication of the health of the filesystem for a long-running system. For filesystems like XFS where fsck.xfs is (unfortunately) an empty shell that does nothing this could be special-cased to call xfs_check. > # parse up fstab > grep -v '^#' /etc/fstab | grep -v '^$' | awk '$6!=0 {print $1,$3,$4;}' | \ > while read FS FSTYPE OPTIONS ; do Urk, that is kind of ugly shell scripting... Cleaner would be: cat /etc/fstab | while read FS DEV FSTYPE OPTIONS DUMP PASS case $FS in "") continue ;; *#*) continue;; esac But I've come to think that /etc/fstab is the wrong thing to use for input. This script is only useful for LVM volumes, so getting a list of LVs is more appropriate I think. > # get the volume group (or an error message) > VG="`lvs --noheadings -o vg_name "$FS" 2>&1`" Interesting, I wasn't aware of lvs... It looks like "lvdisplay -C". Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From menscher at uiuc.edu Thu Jan 24 08:27:23 2008 From: menscher at uiuc.edu (Damian Menscher) Date: Thu, 24 Jan 2008 00:27:23 -0800 Subject: forced fsck (again?) In-Reply-To: <4797F397.9020306@kadzban.is-a-geek.net> References: <200801221701.50202.giancarlo.corti@supsi.ch> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> Message-ID: <1d8411e00801240027p5be6d840h43a04f4ee60398e0@mail.gmail.com> 2008/1/23 Bryan Kadzban : > But it's not too difficult to parse out a custom SNAPSIZE option, and > even have a DEFAULT_SNAPSIZE in the config file if no SNAPSIZE option is > present on any LV, if the script is going to parse fstab anyway. (Or > should the option's name be lowercase? Either will work.) At the risk of adding complexity, what about having the SNAPSIZE be automatically determined? Most users would have no idea what to set it to, and we should be able to guess some reasonable values. For example, the fsck time can probably be estimated by looking at the number of inodes, how full the filesystem is, etc. Alternatively, we could just allocate all available space in the LVM. I also have a newbie question: does the fsck of a snapshot really catch everything that might be wrong with the drive, or are there other failure modes that only a real fsck would catch? I'm wondering if it's still a good idea to do an occasional full fsck. Damian -- http://www.uiuc.edu/~menscher/ From bryan at kadzban.is-a-geek.net Thu Jan 24 12:19:17 2008 From: bryan at kadzban.is-a-geek.net (Bryan Kadzban) Date: Thu, 24 Jan 2008 07:19:17 -0500 Subject: forced fsck (again?) In-Reply-To: <20080124043930.GG18433@webber.adilger.int> References: <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> <20080124043930.GG18433@webber.adilger.int> Message-ID: <47988245.4010904@kadzban.is-a-geek.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Andreas Dilger wrote: > On Jan 23, 2008 21:10 -0500, Bryan Kadzban wrote: >> But it's not too difficult to parse out a custom SNAPSIZE option, >> and even have a DEFAULT_SNAPSIZE in the config file if no SNAPSIZE >> option is present on any LV, if the script is going to parse fstab >> anyway. (Or should the option's name be lowercase? Either will >> work.) > > The problem with this is that ext2/3/4, along with most other > filesystems will fail to mount if passed an unknown mount option. Uh oh. Yeah, that's a problem. I was under the impression that all the tools would ignore unknown options -- if that's not the case, then we probably need to come up with something else. Automatically determining the snapshot size sounds like a good idea, but I'm not sure how to do it. (I'm not sure what decides the snapshot size that you need -- it looks like the number of changes that you're going to make to the snapshot, or maybe the number of changes that you're going to make to both the snapshot and the real LV? In either case, I'm not sure how to find that out. Maybe just using "all available space in the VG" is a better idea anyway.) >> Regarding the idea of having this support multiple filesystems -- >> that's a good idea, I think, but the current script is highly >> specific to ext2 or ext3. Use of tune2fs (to reset the last-check >> time) and dumpe2fs (to find the last-check time), in particular, >> will be problematic on other FSes. I haven't done that in this >> script, though it may be possible. > > Well, my equivalent script just checks for fsck.${fstype} and runs > that on the snapshot, if available. Even if tune2fs isn't there to > update a "last checked" field, it is still a useful indication of the > health of the filesystem for a long-running system. True, but what about determining whether it has to run at all (based on the last-check time)? Although, I suppose it would work to leave the check interval set in the superblock, and avoid using fsck.* -f; that way each fsck would be able to determine if it should do a full check or not. Of course that means that if you can't update the last-checked time, then it'll run a check every day after the interval passes (and the machine is on AC). Of course the current script will do that too, so at least it isn't any worse there. >> grep -v '^#' /etc/fstab | grep -v '^$' | awk '$6!=0 {print $1,$3,$4;}' | \ >> while read FS FSTYPE OPTIONS ; do > > Urk, that is kind of ugly shell scripting... Yeah, no kidding. I wanted to kill lines with fs_passno set to zero, since I was already killing lines that were empty or comments. I was also afraid that sh would die if I gave "read" more variables than arguments (which is why I wanted to filter out the comments), but doing some testing shows that bash (at least) handles it OK. So maybe a normal read would work better. Or maybe rewriting in C would work; then I could just use getmntent. Although I'm not exactly a fan of writing something like this in C, either; shell is more powerful, except for this "reading fstab" thing. > But I've come to think that /etc/fstab is the wrong thing to use for > input. This script is only useful for LVM volumes, so getting a list > of LVs is more appropriate I think. True, except the no-LVs behavior of lvscan, lvs, and any of the other tools that I was looking at yesterday is decidedly non-optimal. It would probably be possible; I'll see what I can find out later today. I have a QEMU VM set up whose root FS is on LVM, on MD-raid, on DM-raid (I was testing an initramfs setup's worst-case), so it has the LVM tools and filesystems. I'll see what's available there. We'd still need to find the FS type, although I believe udev provides some programs that may be helpful (if we want to rely on them being installed). volume_id, in particular, should provide that info. >> # get the volume group (or an error message) >> VG="`lvs --noheadings -o vg_name "$FS" 2>&1`" > > Interesting, I wasn't aware of lvs... It looks like "lvdisplay -C". Sort of, although I'm not sure what -C does (it's not in my lvdisplay manpage). That manpage refers to lvs (saying "lvs provides considerably more control over the output"), and that was what I was looking for. It's fairly easy to get it to print just the VG or just the LV, which is what I needed. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHmIJES5vET1Wea5wRA417AKDInHscG+5bta5gSiC2hJ3QKeN05ACgzeCQ 8Wpo9KPog+p1gZMzrgN+Yp8= =XgD8 -----END PGP SIGNATURE----- From bryan at kadzban.is-a-geek.net Thu Jan 24 12:20:31 2008 From: bryan at kadzban.is-a-geek.net (Bryan Kadzban) Date: Thu, 24 Jan 2008 07:20:31 -0500 Subject: forced fsck (again?) In-Reply-To: <1d8411e00801240027p5be6d840h43a04f4ee60398e0@mail.gmail.com> References: <200801221701.50202.giancarlo.corti@supsi.ch> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> <1d8411e00801240027p5be6d840h43a04f4ee60398e0@mail.gmail.com> Message-ID: <4798828F.3030303@kadzban.is-a-geek.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Damian Menscher wrote: > I also have a newbie question: does the fsck of a snapshot really > catch everything that might be wrong with the drive, or are there > other failure modes that only a real fsck would catch? AFAIK, it catches everything. The LVM2 snapshot is effectively a copy of the FS at the time the snapshot was taken. Of course, that could be wrong, but I don't believe so... -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHmIKNS5vET1Wea5wRA+dRAJwJUjV4A3O1WjvmOj7EDZgZZJg/hwCeKgt1 yUIA6B+esLW8YFIzyzMWQeY= =T1+Q -----END PGP SIGNATURE----- From adilger at sun.com Thu Jan 24 15:19:23 2008 From: adilger at sun.com (Andreas Dilger) Date: Thu, 24 Jan 2008 08:19:23 -0700 Subject: forced fsck (again?) In-Reply-To: <4798828F.3030303@kadzban.is-a-geek.net> References: <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> <1d8411e00801240027p5be6d840h43a04f4ee60398e0@mail.gmail.com> <4798828F.3030303@kadzban.is-a-geek.net> Message-ID: <20080124151923.GI18433@webber.adilger.int> On Jan 24, 2008 07:20 -0500, Bryan Kadzban wrote: > Damian Menscher wrote: > > At the risk of adding complexity, what about having the SNAPSIZE be > > automatically determined? Most users would have no idea what to set > > it to, and we should be able to guess some reasonable values. For > > example, the fsck time can probably be estimated by looking at the > > number of inodes, how full the filesystem is, etc. Alternatively, we > > could just allocate all available space in the LVM. Yes, this is what my script does, basically guess at a size (1/500th of the LV size, limited by the amount of free space in the VG). It should be possible to override this in a .conf file, but it should be possible for the majority of systems to run with the defaults. > > I also have a newbie question: does the fsck of a snapshot really > > catch everything that might be wrong with the drive, or are there > > other failure modes that only a real fsck would catch? > > AFAIK, it catches everything. The LVM2 snapshot is effectively a copy > of the FS at the time the snapshot was taken. Yes, it should catch everything. The snapshot process forces the filesystem to flush everything to disk in a consistent manner, as if it were unmounted cleanly and a full copy of the device was made. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From bryan at kadzban.is-a-geek.net Fri Jan 25 03:20:04 2008 From: bryan at kadzban.is-a-geek.net (Bryan Kadzban) Date: Thu, 24 Jan 2008 22:20:04 -0500 Subject: forced fsck (again?) In-Reply-To: <47988245.4010904@kadzban.is-a-geek.net> References: <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> <20080124043930.GG18433@webber.adilger.int> <47988245.4010904@kadzban.is-a-geek.net> Message-ID: <47995564.2050402@kadzban.is-a-geek.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Bryan Kadzban wrote: > Maybe just using "all available space in the VG" is a better idea > anyway. That's what I did here, at least for now. There's a place in here where the available space in the VG can be checked, but I'm not sure how to get that value out of lvs (or vgs) in a format that's easy to parse, so I skipped it for now as well. (I could only get values like "250m", which I assume means 250 megs, but how is the script supposed to handle the suffixes?) > I suppose it would work to leave the check interval set in the > superblock, and avoid using fsck.* -f; that way each fsck would be > able to determine if it should do a full check or not. Turns out that will *not* work. fsck.* without -f will succeed even if it doesn't check anything (or at least, e2fsck will). So every day, the last-check day will get bumped, even though nothing actually got checked. That defeats the purpose here. I've split out the operations of checking the FS, setting the last-check time to now, setting the last-check time to some time in the ancient past (if the check fails -- this forces the next-reboot check to be a full one), and getting the last-check time, into their own functions. Each one takes a device name and filesystem type argument, and splits execution paths depending on the FS type. Adding support for a new FS (e.g. better support for reiser) should be as easy as modifying the case statements in four functions. > It would probably be possible; I'll see what I can find out later > today. I have a QEMU VM set up whose root FS is on LVM... Well, it was set up. I seem to have somehow nuked the md-raid layer, so the LVM stuff isn't available anymore. (It involved a qemu bug (the VM was running, and suddenly died); then when starting it back up, the md-raid code started a "background rebuild", and ended up locking up qemu. I'll probably have to start over with a new set of image files.) > We'd still need to find the FS type, although I believe udev provides > some programs that may be helpful (if we want to rely on them being > installed). volume_id, in particular, should provide that info. I'm running /lib/udev/vol_id here to get the FS type. I'm not sure if that's the best solution or not, but it does work (at least for now). Anyway, I've also renamed the script from e2check to lvcheck (since it works for more than ext* now). Same changelog entry as before, though. Signed-Off-By: Bryan Kadzban -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHmVVjS5vET1Wea5wRA6sLAJ472TUX1amJroWIxdGbqQqlLZrS2QCeLHAA z/fhwCISV3krc/coAmfWlVw= =5gFW -----END PGP SIGNATURE----- -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lvcheck URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lvcheck.conf URL: From jack at suse.cz Fri Jan 25 16:09:31 2008 From: jack at suse.cz (Jan Kara) Date: Fri, 25 Jan 2008 17:09:31 +0100 Subject: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66) In-Reply-To: <20080114131454.37eb7c12@think.oraclecorp.com> References: <200712242302.lBON2O8s011190@agora.fsl.cs.sunysb.edu> <477BF72B.4000608@oracle.com> <20080114170609.GH4214@duck.suse.cz> <20080114131454.37eb7c12@think.oraclecorp.com> Message-ID: <20080125160931.GC1767@duck.suse.cz> On Mon 14-01-08 13:14:54, Chris Mason wrote: > On Mon, 14 Jan 2008 18:06:09 +0100 > Jan Kara wrote: > > On Wed 02-01-08 12:42:19, Zach Brown wrote: > > > Erez Zadok wrote: > > > > Setting: ltp-full-20071031, dio01 test on ext3 with Linus's > > > > latest tree. Kernel w/ SMP, preemption, and lockdep configured. > > > > > > This is a real lock ordering problem. Thanks for reporting it. > > > > > > The updating of atime inside sys_mmap() orders the mmap_sem in the > > > vfs outside of the journal handle in ext3's inode dirtying: > > > > > [ lock inversion traces ] > > > > Two fixes come to mind: > > > > > > 1) use something like Peter's ->mmap_prepare() to update atime > > > before acquiring the mmap_sem. > > > ( http://lkml.org/lkml/2007/11/11/97 ). I don't know if this would > > > leave more paths which do a journal_start() while holding the > > > mmap_sem. > > > > > > 2) rework ext3's dio to only hold the jbd handle in > > > ext3_get_block(). Chris has a patch for this kicking around > > > somewhere but I'm told it has problems exposing old blocks in > > > ordered data mode. > > > > > > Does anyone have preferences? I could go either way. I certainly > > > don't like the idea of journal handles being held across the > > > entirety of fs/direct-io.c. It's yet another case of O_DIRECT > > > differing wildly from the buffered path :(. > > I've looked more into it and I think that 2) is the only way to go > > since transaction start ranks below page lock (standard buffered > > write path) and page lock ranks below mmap_sem. So we have at least > > one more dependency mmap_sem must go before transaction start... > > Just to clarify a little bit: > > If ext3's DIO code only touches transactions in get_block, then it can > violate data=ordered rules. Basically the transaction that allocates > the blocks might commit before the DIO code gets around to writing them. > > A crash in the wrong place will expose stale data on disk. Hmm, I've looked at it and I don't think so - look at the rationale in the patch below... That patch should fix the lock-inversion problem (at least I see no lockdep warnings on my test machine). Honza -- Jan Kara SUSE Labs, CR --- We cannot start transaction in ext3_direct_IO() and just let it last during the whole write because dio_get_page() acquires mmap_sem which ranks above transaction start (e.g. because we have dependency chain mmap_sem->PageLock->journal_start, or because we update atime while holding mmap_sem) and thus deadlocks could happen. We solve the problem by starting a transaction separately for each ext3_get_block() call. We *could* have a problem that we allocate a block and before its data are written out the machine crashes and thus we expose stale data. But that does not happen because for hole-filling generic code falls back to buffered writes and for file extension, we add inode to orphan list and thus in case of crash, journal replay will truncate inode back to the original size. Signed-off-by: Jan Kara diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 9b162cd..5ab7c57 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -941,55 +941,45 @@ out: return err; } -#define DIO_CREDITS (EXT3_RESERVE_TRANS_BLOCKS + 32) +/* Maximum number of blocks we map for direct IO at once. */ +#define DIO_MAX_BLOCKS 4096 +/* + * Number of credits we need for writing DIO_MAX_BLOCKS: + * We need sb + group descriptor + bitmap + inode -> 4 + * For B blocks with A block pointers per block we need: + * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect). + * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25. + */ +#define DIO_CREDITS 25 static int ext3_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create) { handle_t *handle = ext3_journal_current_handle(); - int ret = 0; + int ret = 0, started = 0; unsigned max_blocks = bh_result->b_size >> inode->i_blkbits; - if (!create) - goto get_block; /* A read */ - - if (max_blocks == 1) - goto get_block; /* A single block get */ - - if (handle->h_transaction->t_state == T_LOCKED) { - /* - * Huge direct-io writes can hold off commits for long - * periods of time. Let this commit run. - */ - ext3_journal_stop(handle); - handle = ext3_journal_start(inode, DIO_CREDITS); - if (IS_ERR(handle)) + if (create && !handle) { /* Direct IO write... */ + if (max_blocks > DIO_MAX_BLOCKS) + max_blocks = DIO_MAX_BLOCKS; + handle = ext3_journal_start(inode, DIO_CREDITS + + 2 * EXT3_QUOTA_TRANS_BLOCKS(sb)); + if (IS_ERR(handle)) { ret = PTR_ERR(handle); - goto get_block; - } - - if (handle->h_buffer_credits <= EXT3_RESERVE_TRANS_BLOCKS) { - /* - * Getting low on buffer credits... - */ - ret = ext3_journal_extend(handle, DIO_CREDITS); - if (ret > 0) { - /* - * Couldn't extend the transaction. Start a new one. - */ - ret = ext3_journal_restart(handle, DIO_CREDITS); + goto out; } + started = 1; } -get_block: - if (ret == 0) { - ret = ext3_get_blocks_handle(handle, inode, iblock, + ret = ext3_get_blocks_handle(handle, inode, iblock, max_blocks, bh_result, create, 0); - if (ret > 0) { - bh_result->b_size = (ret << inode->i_blkbits); - ret = 0; - } + if (ret > 0) { + bh_result->b_size = (ret << inode->i_blkbits); + ret = 0; } + if (started) + ext3_journal_stop(handle); +out: return ret; } @@ -1680,7 +1670,8 @@ static int ext3_releasepage(struct page *page, gfp_t wait) * if the machine crashes during the write. * * If the O_DIRECT write is intantiating holes inside i_size and the machine - * crashes then stale disk data _may_ be exposed inside the file. + * crashes then stale disk data _may_ be exposed inside the file. But current + * VFS code falls back into buffered path in that case so we are safe. */ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, @@ -1689,7 +1680,7 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb, struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; struct ext3_inode_info *ei = EXT3_I(inode); - handle_t *handle = NULL; + handle_t *handle; ssize_t ret; int orphan = 0; size_t count = iov_length(iov, nr_segs); @@ -1697,17 +1688,21 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb, if (rw == WRITE) { loff_t final_size = offset + count; - handle = ext3_journal_start(inode, DIO_CREDITS); - if (IS_ERR(handle)) { - ret = PTR_ERR(handle); - goto out; - } if (final_size > inode->i_size) { + /* Credits for sb + inode write */ + handle = ext3_journal_start(inode, 2); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + goto out; + } ret = ext3_orphan_add(handle, inode); - if (ret) - goto out_stop; + if (ret) { + ext3_journal_stop(handle); + goto out; + } orphan = 1; ei->i_disksize = inode->i_size; + ext3_journal_stop(handle); } } @@ -1715,18 +1710,21 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb, offset, nr_segs, ext3_get_block, NULL); - /* - * Reacquire the handle: ext3_get_block() can restart the transaction - */ - handle = ext3_journal_current_handle(); - -out_stop: - if (handle) { + if (orphan) { int err; - if (orphan && inode->i_nlink) + /* Credits for sb + inode write */ + handle = ext3_journal_start(inode, 2); + if (IS_ERR(handle)) { + /* This is really bad luck. We've written the data + * but cannot extend i_size. Bail out and pretend + * the write failed... */ + ret = PTR_ERR(handle); + goto out; + } + if (inode->i_nlink) ext3_orphan_del(handle, inode); - if (orphan && ret > 0) { + if (ret > 0) { loff_t end = offset + ret; if (end > inode->i_size) { ei->i_disksize = end; From adilger at sun.com Fri Jan 25 00:36:05 2008 From: adilger at sun.com (Andreas Dilger) Date: Thu, 24 Jan 2008 17:36:05 -0700 Subject: forced fsck (again?) In-Reply-To: <47988245.4010904@kadzban.is-a-geek.net> References: <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> <20080124043930.GG18433@webber.adilger.int> <47988245.4010904@kadzban.is-a-geek.net> Message-ID: <20080125003605.GP18433@webber.adilger.int> On Jan 24, 2008 07:19 -0500, Bryan Kadzban wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Andreas Dilger wrote: > > The problem with this is that ext2/3/4, along with most other > > filesystems will fail to mount if passed an unknown mount option. > > Uh oh. Yeah, that's a problem. > > I was under the impression that all the tools would ignore unknown > options -- if that's not the case, then we probably need to come up with > something else. Automatically determining the snapshot size sounds like > a good idea, but I'm not sure how to do it. (I'm not sure what decides > the snapshot size that you need -- it looks like the number of changes > that you're going to make to the snapshot, or maybe the number of > changes that you're going to make to both the snapshot and the real LV? Since we aren't making any changes to the LV it is only the changes that are made to the original volume that consume space in the volume. > In either case, I'm not sure how to find that out. Maybe just using > "all available space in the VG" is a better idea anyway.) I made a wild guess of 1/500 of the total volume size. Making the snapshot size a linear function of the volume size makes sense, because the fsck time is generally linear with the volume size, and the amount of change in the original volume (and hence the space needed in the snapshot) is also a linear function of how long the fsck runs. Having a minimum size for things like the journal, and a maximum size of the free space in the VG definitely makes sense. Another thing worth checking in the script is if there is an existing snapshot volume (maybe left over if the script was interrupted by a crash) and delete it before recreating the volume. It also makes sense to have a very clear name like "{lvname}.fsck.temporary.20080124" that can be easily seen by the user as not very useful, and can also be deleted by the script safely. > True, but what about determining whether it has to run at all (based on > the last-check time)? Although, I suppose it would work to leave the > check interval set in the superblock, and avoid using fsck.* -f; that > way each fsck would be able to determine if it should do a full check or > not. I would just run the script from cron.weekly instead of every night. If we miss the check for a few days this isn't harmful, and better than annoying users. > Or maybe rewriting in C would work; then I could just use getmntent. > Although I'm not exactly a fan of writing something like this in C, > either; shell is more powerful, except for this "reading fstab" thing. No, I'd rather have a shell script... Less long-term maintenance. > > But I've come to think that /etc/fstab is the wrong thing to use for > > input. This script is only useful for LVM volumes, so getting a list > > of LVs is more appropriate I think. > > True, except the no-LVs behavior of lvscan, lvs, and any of the other > tools that I was looking at yesterday is decidedly non-optimal. What is the problem there? My simple test showed "lvs" on a system w/o LVM reports "No volume groups found" to stderr, and that can easily be ignored. > We'd still need to find the FS type, although I believe udev provides > some programs that may be helpful (if we want to rely on them being > installed). volume_id, in particular, should provide that info. If it's part of e2fsprogs, then using "blkid" is much better, since it is also part of e2fsprogs. export `blkid -s TYPE $FS | cut -d' ' -f2` will set an environment variable TYPE={fstype}. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From adilger at sun.com Fri Jan 25 08:55:57 2008 From: adilger at sun.com (Andreas Dilger) Date: Fri, 25 Jan 2008 01:55:57 -0700 Subject: forced fsck (again?) In-Reply-To: <47995564.2050402@kadzban.is-a-geek.net> References: <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> <20080124043930.GG18433@webber.adilger.int> <47988245.4010904@kadzban.is-a-geek.net> <47995564.2050402@kadzban.is-a-geek.net> Message-ID: <20080125085557.GV18433@webber.adilger.int> On Jan 24, 2008 22:20 -0500, Bryan Kadzban wrote: > # Run this from cron each night. If the machine is on AC power, it > # will run the checks; otherwise they will all be skipped. (If the > # script can't tell whether the machine is on AC power, a setting in > # the configuration file (/etc/lvcheck.conf) decides whether it will > # continue with the checks, or abort.) Probably once a week is enough, and "/etc/cron.weekly" (anacron) exists on most systems and will ensure that if the system was off for more than a week it will still be run on the next boot. > # Any LV that passes fsck will have its last-check time updated (in > # the real superblock, not the snapshot's superblock); any LV whose > # fsck fails will send an email notification to a configurable user > # ($EMAIL). This $EMAIL setting is optional, but its use is highly > # recommended, since if any LV fails, it will need to be checked > # manually, offline. I would recommend also using "logger" to log something in /var/log/messages. > # attempt to force a check of $1 on the next reboot > function try_force_check() { > local dev="$1" > local fstype="$2" > > case "$fstype" in > ext2|ext3) > tune2fs -C 16000 -T "19000101" "$dev" > ;; > reiserfs) > # ??? > echo "Don't know how to set the last-check time on reiserfs..." >&2 > ;; > *) > echo "Don't know how to set the last-check time on $fstype..." >&2 > ;; > esac > } These error messages are incorrect, namely "set the last-check time" should be replaced with "force a check". Since there isn't any reason to special case reiserfs here, you may as well remove it. I suspect that a nice email to the XFS and JFS folks would get them to add some mechanism to force a filesystem check on the next reboot. > # check the FS on $1 passively, printing output to $3. > function perform_check() { > case "$fstype" in > ext2|ext3) > # the only point in fixing anything is just to see if fsck can. > nice logsave -as "${tmpfile}" fsck.${fstype} -p -C 0 "$dev" && > nice logsave -as "${tmpfile}" fsck.${fstype} -fy -C 0 "$dev" Hmm, I'm not sure I understand what it is you want to do? The fsck should be run as 'e2fsck -fn "$dev"' (since we already know this is ext2|ext3). Using "-C 0" isn't useful because we don't want progress in the output log, and "-p" without "-f" will just check the superblock. We don't want to be fixing anything (since this should be a read-only snapshot) so "-fy" is also not so great. > # do everything needed to check and reset dates and counters on /dev/$1/$2. > function check_fs() { > local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX` > trap "rm $tmpfile ; trap - RETURN" RETURN For the log file it probably makes sense to keep this around with a timestamp if there is a failure. That means it is fine to generate a random filename temporarily, but it should be renamed to something meaningful (e.g. /var/log/lvfsck.$dev.$(date +%Y%m%d) or similar). > # only one check happens at a time; using all the free space in the VG > # at least won't prevent other checks from happening... > lvcreate -s -l "100%FREE" -n "${lv}-snap" "${vg}/${lv}" To find free space, use "vgs -o vg_size --noheadings ${vg}", and the LV size can be had from "lvs -o lv_size --noheadings ${vg}/${lv}". You can strip the size suffixes with "--units M --nosuffix" to get units of MB. Also good to create a more unique name than "${lv}-snap", since that might conflict with an existing snapshot, and if the script crashes the user might be wondering if that LV using 100% of the free space is safe to delete or not. Please also add XFS support here, having it call "xfs_check", since fsck.xfs is an empty shell... For JFS it can also use "fsck.jfs -fn $dev" to check the filesystem. > if perform_check "/dev/${vg}/${lv}-snap" "${fstype}" "${tmpfile}" ; then > echo 'Background scrubbing succeeded!' > try_delay_checks "/dev/${vg}/${lv}" "$fstype" > else > echo 'Background scrubbing failed! Reboot to fsck soon!' Printing the device name in these messages, and sending them to the syslog via logger would probably be more useful. > try_force_check "/dev/${vg}/${lv}" "$fstype" > > if test -n "$EMAIL"; then > mail -s "Fsck of /dev/${vg}/${lv} failed!" $EMAIL < $tmpfile > fi > > set -e Have you verified that the script doesn't exit if an fsck fails with an error? > # pull in configuration -- don't bother with a parser, just use the shell's > . /etc/lvcheck.conf You should check that this file exists before sourcing it, or the script will exit with an error: [ -r /etc/lvcheck.conf ] && . /etc/lvcheck.conf > # parse up lvscan output > lvscan 2>&1 | grep ACTIVE | awk '{print $2;}' | \ > while read DEV ; do > # remove the single quotes around the device name > DEV="`echo "$DEV" | tr -d \'`" > > # get the FS type > FSTYPE="`/lib/udev/vol_id -t "$DEV"`" Please use "blkid", since that is part of e2fsprogs already and avoids an extra dependency. > # if the date is unknown, run fsck every day. sigh. Better to write "run fsck each time the script is run". > # get the free space > SPACE="`lvs --noheadings -o vg_free "$DEV"`" > > # ensure that some free space exists, at least > # ??? -- can lvs print vg_free in plain numbers, or do I have to > # figure out what a suffix of "m" means? skip the check for now. "vgs", and --nosuffix, per above. > #!/bin/sh > > # e2check configuration variables: > # > # EMAIL > # Address to send failure notifications to. If empty, > # failure notifications will not be sent. > # > # INTERVAL > # Days to wait between checks. All LVs use the same > # INTERVAL, but the "days since last check" value can > # be different per LV, since that value is stored in > # the ext2/ext3 superblock. > # > # AC_UNKNOWN > # Whether to run the e2fsck checks if the script can't > # determine whether the machine is on AC power. Laptop > # users will want to set this to ABORT, while server and > # desktop users will probably want to set this to > # CONTINUE. Those are the only two valid values. > > EMAIL='root' > INTERVAL=30 > AC_UNKNOWN="ABORT" I would also make these all be defaults in the script (before this file is parsed), so it works as expected if /etc/lvscan.conf doesn't exist. I'd also recommend that the default for AC_UNKNOWN be CONTINUE (or possibly leave it unset by default and have the script not error out in this case, so that the script does something useful for the majority of users. If we are worried about the laptop case, we could add checks to see if the system has a PC card, since very few desktop systems have them. Both the commands "pccardctl info" and "cardctl info" produce no output on stdout if there is no PC card slot, and this could be used to decide between "CONTINUE" for desktops and "ABORT" for laptops. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From bryan at kadzban.is-a-geek.net Sat Jan 26 02:02:56 2008 From: bryan at kadzban.is-a-geek.net (Bryan Kadzban) Date: Fri, 25 Jan 2008 21:02:56 -0500 Subject: forced fsck (again?) In-Reply-To: <20080125085557.GV18433@webber.adilger.int> References: <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> <20080124043930.GG18433@webber.adilger.int> <47988245.4010904@kadzban.is-a-geek.net> <47995564.2050402@kadzban.is-a-geek.net> <20080125085557.GV18433@webber.adilger.int> Message-ID: <479A94D0.9080308@kadzban.is-a-geek.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Andreas Dilger wrote: > On Jan 24, 2008 22:20 -0500, Bryan Kadzban wrote: >> # Run this from cron each night. > > Probably once a week is enough, and "/etc/cron.weekly" (anacron) exists > on most systems and will ensure that if the system was off for more than > a week it will still be run on the next boot. Yeah, it's probably true that once per week is enough. Do you think it would still make sense to try and parse out the last-check time from the LV if this gets run each week, or just unconditionally check everything (if on AC)? Checking everything weekly might be too often (especially if the extra disk usage ends up exposing bad bits on a disk), but maybe not. > I would recommend also using "logger" to log something in /var/log/messages. Yeah, that makes sense. logger is part of util-linux{,-ng}, so that's not a huge extra dependency either. >> echo "Don't know how to set the last-check time on $fstype..." >&2 > > These error messages are incorrect, namely "set the last-check time" should > be replaced with "force a check". That's true. I was trying to get the errors to refer to what specific information needed to be added to the script (in this case, it needs to know how to set the last-check time), but "force a check" is probably safer anyway. Setting the last-check time may not be the method that every FS uses. > Since there isn't any reason to special > case reiserfs here, you may as well remove it. That's what I get for deciding to handle reiser separately everywhere, and then changing my mind later -- I forgot to go back and remove this case. Oops... :-) > I suspect that a nice email to the XFS and JFS folks would get them to add > some mechanism to force a filesystem check on the next reboot. Is the issue that those FSes don't have any such mechanism today, or is it just that I don't know how to do this on them? (I'll have to go look up the XFS/JFS lists, too, but that's not terribly difficult.) >> nice logsave -as "${tmpfile}" fsck.${fstype} -p -C 0 "$dev" && >> nice logsave -as "${tmpfile}" fsck.${fstype} -fy -C 0 "$dev" > > Hmm, I'm not sure I understand what it is you want to do? Well, neither do I, necessarily -- those arguments were copied from the initial script that I hacked the extra stuff into (the one that Ted posted at the start of this whole thing). :-) I see that your script just uses -fn; that's probably simpler anyway. What it doesn't determine is whether fsck would be able to automatically repair the damage that it finds; I guess the question is whether this condition should be treated as a fsck failure (requiring a reboot to fix) or not. It probably depends on the severity of the fixes that fsck makes... OTOH, if you give e2fsck the -fy option, and it does make changes, its exit status will not be zero, so it will already be treated as a failure by this script. So the only difference is that -fn stops it from writing to the snapshot just to have the writes thrown away; that's probably actually good. > and "-p" without "-f" will just check the superblock. Yeah, I think the idea was to check the superblock first, and then check the rest of the FS. But I think -fn is probably more explicit about what we want fsck to do, too. (Plus, even if we do take a read-write snapshot with LVM2, there's no point in taking up extra space by writing to the snapshot itself, if it's just going to get thrown away.) > For the log file it probably makes sense to keep this around with a > timestamp if there is a failure. And let e.g. logrotate get rid of older versions; yeah, that makes sense. > To find free space, use "vgs -o vg_size --noheadings ${vg}", and the > LV size can be had from "lvs -o lv_size --noheadings ${vg}/${lv}". Free space can also be retrieved with -o vg_free, but yeah. > You can strip the size suffixes with "--units M --nosuffix" to get > units of MB. Ah, that was the bit I was missing yesterday (further down in the script): --nosuffix. Thanks! I also just got your message from yesterday about the guess behind the (based on the frequency of writes to the main LV); that makes sense. And since I can get the size out of lvs, that makes that much easier, too, so I'll just use 1/500th the LV size. > Also good to create a more unique name than "${lv}-snap", since that > might conflict with an existing snapshot, and if the script crashes > the user might be wondering if that LV using 100% of the free space is > safe to delete or not. Yeah, that was left over from the original script as well. Changing it makes sense. > Please also add XFS support here, Done, I think. I assume xfs_check doesn't need any args? (Should fsck.xfs perhaps just exec xfs_check and pass it all the args? That's a whole separate discussion, probably.) > For JFS it can also use "fsck.jfs -fn $dev" to check the filesystem. Done. >> echo 'Background scrubbing succeeded!' >> echo 'Background scrubbing failed! Reboot to fsck soon!' > > Printing the device name in these messages, and sending them to the syslog > via logger would probably be more useful. True; done. The severity may need a bit of tweaking, but hopefully not much. >> set -e > > Have you verified that the script doesn't exit if an fsck fails with an > error? No, the script exits if fsck fails with an error. That's obviously bad - -- I wasn't thinking that far ahead when I added that. It's gone now. >> . /etc/lvcheck.conf > > You should check that this file exists before sourcing it, or the script will > exit with an error That was intended; I figured the config file would be required (back when I first added it). But since we have decent default values for the settings in it, it probably makes sense to make it optional now. >> FSTYPE="`/lib/udev/vol_id -t "$DEV"`" > > Please use "blkid", since that is part of e2fsprogs already and avoids > an extra dependency. True. Looking at the manpages, it appears that vol_id does some extra checks to try to detect RAID members as RAID members, instead of partitions containing a filesystem. But that would only affect this script if someone had multiple LVs RAIDed together, and I doubt that's well-supported elsewhere, so blkid is fine. >> # if the date is unknown, run fsck every day. sigh. > > Better to write "run fsck each time the script is run". Yeah, that makes more sense. >> # ??? -- can lvs print vg_free in plain numbers, or do I have to >> # figure out what a suffix of "m" means? skip the check for now. > > "vgs", and --nosuffix, per above. Yep, done. >> EMAIL='root' >> INTERVAL=30 >> AC_UNKNOWN="ABORT" > > I would also make these all be defaults in the script (before this file is > parsed), so it works as expected if /etc/lvscan.conf doesn't exist. Since it's now optional, yes, that makes sense. > I'd also recommend that the default for AC_UNKNOWN be CONTINUE (or possibly > leave it unset by default and have the script not error out in this case, > so that the script does something useful for the majority of users. Well, it depends on whether the majority of users have laptops, or some other hardware type (desktops, servers, etc.). I was thinking that laptops would be more prevalent, but since this is Linux, it's probably actually servers. OK -- CONTINUE it is, by default. > If we are worried about the laptop case, we could add checks to see > if the system has a PC card, since very few desktop systems have them. > Both the commands "pccardctl info" and "cardctl info" produce no output > on stdout if there is no PC card slot, and this could be used to decide > between "CONTINUE" for desktops and "ABORT" for laptops. Or stuff it into comments in the config file. Pushing the decision back onto the user makes me a bit uncomfortable, but fuzzy decisions (ones that aren't necessarily based on the right info) make me even less comfortable. Hmm. And depending how the power_supply sysfs class ends up working, maybe this is all a moot point anyway: if it always has devices under it on >=2.6.24, then the setting won't even matter. For now, I'll just leave the default CONTINUE, but with comments in the config file aimed at laptop users. - ---- Create a script to transparently run fsck in the background on any active LVM logical volumes, as long as the machine is on AC power, and that LV has been last checked more than a configurable number of days ago. Also create an optional configuration file to set various options in the script. Signed-Off-By: Bryan Kadzban -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHmpTOS5vET1Wea5wRA2XXAKCZzt9SEOSBVs4EkrI4gt3Ztl0v5wCg3gq5 1ChmnEccT+hFVo/2B/RpU8U= =D4HV -----END PGP SIGNATURE----- -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lvcheck URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lvcheck.conf URL: From tytso at mit.edu Sat Jan 26 04:33:34 2008 From: tytso at mit.edu (Theodore Tso) Date: Fri, 25 Jan 2008 23:33:34 -0500 Subject: forced fsck (again?) In-Reply-To: <20080125085557.GV18433@webber.adilger.int> References: <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> <20080124043930.GG18433@webber.adilger.int> <47988245.4010904@kadzban.is-a-geek.net> <47995564.2050402@kadzban.is-a-geek.net> <20080125085557.GV18433@webber.adilger.int> Message-ID: <20080126043334.GB28889@mit.edu> On Fri, Jan 25, 2008 at 01:55:57AM -0700, Andreas Dilger wrote: > > nice logsave -as "${tmpfile}" fsck.${fstype} -p -C 0 "$dev" && > > nice logsave -as "${tmpfile}" fsck.${fstype} -fy -C 0 "$dev" > > Hmm, I'm not sure I understand what it is you want to do? The fsck should > be run as 'e2fsck -fn "$dev"' (since we already know this is ext2|ext3). > Using "-C 0" isn't useful because we don't want progress in the output log, This was my fault. It means that when you run this from a tty, you get to see the progress bar. The -s flag to logsave will strip out the progress information. (I added logsave -s precisely for this purpose. :-) > and "-p" without "-f" will just check the superblock. That's needed e2fsck -p will clean up the orphaned inode list, so that the subsequent e2fsck -fy will return 0 if the filesystem is clean. Without the the fsck -p, then e2fsck -fy will return 1 (because it modified the filesystem) which we can't distinguish from the case where the filesystem had errors. > We don't want to be > fixing anything (since this should be a read-only snapshot) so "-fy" is > also not so great. This is a tradeoff. e2fsck -fy requires that the snapshot have more space (although if you run off, it's not that horrible; the snapshot will just go invalid). The advantage of "-fy" is that you get more information about any errors in the filesystem, where as "-fn" may not report as useful information. > > # do everything needed to check and reset dates and counters on /dev/$1/$2. > > function check_fs() { > > local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX` > > trap "rm $tmpfile ; trap - RETURN" RETURN > > For the log file it probably makes sense to keep this around with a > timestamp if there is a failure. That means it is fine to generate a > random filename temporarily, but it should be renamed to something > meaningful (e.g. /var/log/lvfsck.$dev.$(date +%Y%m%d) or similar). The idea is if there is a failure we'll e-mail to the administrator; after that, there's no real need to keep it around. - Ted From bryan at kadzban.is-a-geek.net Tue Jan 29 00:56:50 2008 From: bryan at kadzban.is-a-geek.net (Bryan Kadzban) Date: Mon, 28 Jan 2008 19:56:50 -0500 Subject: forced fsck (again?) In-Reply-To: <20080128174804.GT18433@webber.adilger.int> References: <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> <20080124043930.GG18433@webber.adilger.int> <47988245.4010904@kadzban.is-a-geek.net> <47995564.2050402@kadzban.is-a-geek.net> <20080125085557.GV18433@webber.adilger.int> <479A94D0.9080308@kadzban.is-a-geek.net> <20080128174804.GT18433@webber.adilger.int> Message-ID: <479E79D2.5070406@kadzban.is-a-geek.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Andreas Dilger wrote: > On Jan 25, 2008 21:02 -0500, Bryan Kadzban wrote: >> logger $arg -p user."$sev" -- "$msg" > > This should use "-t lvcheck" so that it reports what program is generating > the message. Yep, that'd be useful. >> tune2fs -C 16000 -T "19000101" "$dev" > > I'm a tiny bit reluctant to overwrite the "last checked" date, since this > might be useful information for the administrator (i.e. it will tell the > interval wherein the corruption was detected). Setting the "mount count" > is enough to force a check, and the mount count itself can be reverse > engineered from "reboot" messages in the "last" log. Assuming the user doesn't set a maximum mount count higher than 16000 (but I think that's highly unlikely). I think the benefit of being able to know (approximately) when corruption started is probably worth it, though. > It is a lot clearer if the "cases" (ext2|ext3|ext4) are aligned with the > "case" statement, I see what you mean. The script just uses vim's default autoindent levels, but I can change the cases. >> reiserfs) >> # do nothing? > > I thought you were going to remove the empty reiserfs cases? Er, I was; I think I was looking at the wrong case last time around. This one's gone now as well. >> local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX` > > Shouldn't be "e2fsck.log"? Maybe "lvcheck.log.XXXXXXXXX"? Yeah, that'd be better; that's more leftover code from the original script. >> # Assume the script won't run more than one instance at a time? >> lvremove -f "${lvtemp##/dev}" > > Should check the error return and bail out of script if there is an error. Will that catch the "more than one instance at a time" case (e.g. if another script run is still running e2fsck on this snapshot)? Assuming lvremove can fail (and it probably can), it's probably a good idea to check it in any case, but if running e2fsck makes lvremove fail (until e2fsck finishes), that's a decent way to get rid of the comment too. Also, I think it'd be better to skip just the current FS, rather than an "exit 1" type bail-out, right? > MINFREE=0 # megabytes to leave free in each volume group > MINSNAP=256 # megabytes for minimum snapshot size. I've added something very similar to this logic, but I changed the checks around a bit. I think it makes more sense this way (doing the overall space check first, and then the limits second), unless this logic disallows some valid combinations? (Still trying to decide how to handle logging *fsck output, and what to do with the file, based on your other message...) - ----- Create a script to transparently run fsck in the background on any active LVM logical volumes, as long as the machine is on AC power, and that LV has been last checked more than a configurable number of days ago. Also create an optional configuration file to set various options in the script. Signed-Off-By: Bryan Kadzban -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHnnnRS5vET1Wea5wRAw0iAJ9wcLyfBSaH5FSIJNH0YakzDCUvjwCgnJEH lPScP39vBYIIjOQPiftgDs8= =XjFF -----END PGP SIGNATURE----- -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lvcheck URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lvcheck.conf URL: From sandeen at redhat.com Tue Jan 29 02:42:11 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Mon, 28 Jan 2008 20:42:11 -0600 Subject: forced fsck (again?) In-Reply-To: <479E79D2.5070406@kadzban.is-a-geek.net> References: <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> <20080124043930.GG18433@webber.adilger.int> <47988245.4010904@kadzban.is-a-geek.net> <47995564.2050402@kadzban.is-a-geek.net> <20080125085557.GV18433@webber.adilger.int> <479A94D0.9080308@kadzban.is-a-geek.net> <20080128174804.GT18433@webber.adilger.int> <479E79D2.5070406@kadzban.is-a-geek.net> Message-ID: <479E9283.5000001@redhat.com> Some hints for xfs, which does not enforce check intervals, so: - no mechanism or need to delay next check - no mechanism to enforce check on next boot; just notify w/ email - no mechanism to read last-checked; just check on acceptable cron interval Also, you really want to use xfs_repair -n instead of xfs_check; it's much faster and more memory-efficient. So most of the xfs) cases are just documenting that xfs can't and/or doesn't need to do anything, they don't really need to be there - up to you. :) -Eric --- lvcheck.orig 2008-01-28 20:23:16.000000000 -0600 +++ lvcheck 2008-01-28 20:40:25.000000000 -0600 @@ -111,6 +111,9 @@ ext2|ext3) tune2fs -C 16000 "$dev" ;; + xfs) + # XFS does not enforce check intervals; let email suffice. + ;; *) log "warning" "Don't know how to force a check on $fstype..." ;; @@ -126,6 +129,9 @@ ext2|ext3) tune2fs -C 0 -T now "$dev" ;; + xfs) + # XFS does not enforce check intervals; nothing to delay + ;; *) log "warning" "Don't know how to delay checks on $fstype..." ;; @@ -143,6 +149,10 @@ dumpe2fs -h "$dev" 2>/dev/null | grep 'Last checked:' | \ sed -e 's/Last checked:[[:space:]]*//' ;; + xfs) + # XFS does not save last-checked; just check on cron interval + echo "Unknown" + ;; *) # TODO: add support for various FSes here echo "Unknown" @@ -167,7 +177,7 @@ return 0 ;; xfs) - nice logsave -as "${tmpfile}" xfs_check "$dev" + nice logsave -as "${tmpfile}" xfs_repair -n "$dev" return $? ;; jfs) From sandeen at redhat.com Tue Jan 29 03:39:26 2008 From: sandeen at redhat.com (Eric Sandeen) Date: Mon, 28 Jan 2008 21:39:26 -0600 Subject: forced fsck (again?) In-Reply-To: <479749A1.5040208@redhat.com> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080123091601.GZ3180@webber.adilger.int> <479749A1.5040208@redhat.com> Message-ID: <479E9FEE.4020506@redhat.com> Eric Sandeen wrote: > Andreas Dilger wrote: > >> Maybe some of the distro folks (Eric? :-) will pick up on this thread and >> consider adding the "e2fsck snapshot" script to cron.monthly or similar. > > I'm watching.... sure, that might be a candidate for Fedora. Ideally > it'd be part of e2fsprogs Er, I guess it really doesn't need to be in e2fsprogs, does it, since it's extending to cover other fs's; it could stand on its own, or maybe even be part of the init infrastructure. I'll ask the folks who own init; otherwise we could package it up on its own. -Eric From adilger at sun.com Mon Jan 28 17:48:04 2008 From: adilger at sun.com (Andreas Dilger) Date: Mon, 28 Jan 2008 10:48:04 -0700 Subject: forced fsck (again?) In-Reply-To: <479A94D0.9080308@kadzban.is-a-geek.net> References: <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> <20080124043930.GG18433@webber.adilger.int> <47988245.4010904@kadzban.is-a-geek.net> <47995564.2050402@kadzban.is-a-geek.net> <20080125085557.GV18433@webber.adilger.int> <479A94D0.9080308@kadzban.is-a-geek.net> Message-ID: <20080128174804.GT18433@webber.adilger.int> On Jan 25, 2008 21:02 -0500, Bryan Kadzban wrote: > > I suspect that a nice email to the XFS and JFS folks would get them to add > > some mechanism to force a filesystem check on the next reboot. > > Is the issue that those FSes don't have any such mechanism today, or is > it just that I don't know how to do this on them? I don't think they have any such mechanism (at least not one that I know about), but I think they will find it useful to add. > (Should fsck.xfs perhaps just exec xfs_check and pass it all the args? > That's a whole separate discussion, probably.) Right... > Create a script to transparently run fsck in the background on any > active LVM logical volumes, as long as the machine is on AC power, and > that LV has been last checked more than a configurable number of days > ago. Also create an optional configuration file to set various options > in the script. > > Signed-Off-By: Bryan Kadzban > #!/bin/sh > # > # lvcheck > > # send $2 to syslog, with severity $1 > # severities are emerg/alert/crit/err/warning/notice/info/debug > function log() { > local sev="$1" > local msg="$2" > local arg= > > # log warning-or-higher messages to stderr as well > [ "$sev" == "emerg" || "$sev" == "alert" || "$sev" == "crit" || \ > "$sev" == "err" || "$sev" == "warning" ] && arg=-s > > logger $arg -p user."$sev" -- "$msg" > } This should use "-t lvcheck" so that it reports what program is generating the message. > # attempt to force a check of $1 on the next reboot > function try_force_check() { > local dev="$1" > local fstype="$2" > > case "$fstype" in > ext2|ext3) > tune2fs -C 16000 -T "19000101" "$dev" I'm a tiny bit reluctant to overwrite the "last checked" date, since this might be useful information for the administrator (i.e. it will tell the interval wherein the corruption was detected). Setting the "mount count" is enough to force a check, and the mount count itself can be reverse engineered from "reboot" messages in the "last" log. > # attempt to set the last-check time on $1 to now, and the mount count to 0. > function try_delay_checks() { > local dev="$1" > local fstype="$2" > > case "$fstype" in > ext2|ext3) It is a lot clearer if the "cases" (ext2|ext3|ext4) are aligned with the "case" statement, like below, since that provides a better separation: case "$fstype" in ext2|ext3|ext4) tune2fs -C 0 -T now "$dev" ;; > reiserfs) > # do nothing? ;; I thought you were going to remove the empty reiserfs cases? > # check the FS on $1 passively, saving output to $3. > function perform_check() { > local dev="$1" > local fstype="$2" > local tmpfile="$3" > > case "$fstype" in > ext2|ext3) Ditto on indenting the cases. > # do everything needed to check and reset dates and counters on /dev/$1/$2. > function check_fs() { > local vg="$1" > local lv="$2" > local fstype="$3" > local snapsize="$4" > > local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX` Shouldn't be "e2fsck.log"? Maybe "lvcheck.log.XXXXXXXXX"? > local errlog="/var/log/lvcheck-${vg}@${lv}-`date +'%Y%m%d'`" > local snaplvbase="${lv}-lvcheck-temp" > local snaplv="${snaplvbase}-`date +'%Y%m%d'`" > > # clean up any left-over snapshot LVs > for lvtemp in /dev/${vg}/${snaplvbase}* ; do > if [ -e "$lvtemp" ] ; then > # Assume the script won't run more than one instance at a time? > lvremove -f "${lvtemp##/dev}" Should check the error return and bail out of script if there is an error. > # parse up lvscan output > lvscan 2>&1 | grep ACTIVE | awk '{print $2;}' | \ > while read DEV ; do > > if [ "$SNAPSIZE" -gt "$SPACE" ] ; then > log "err" "Can't take a snapshot of $DEV: not enough free space in the VG." > continue Well, the 1/500 rule is only a guideline. For example, I have a huge filesystem for TV shows, but it doesn't change that often, so it would make more sense to just reduce $SNAPSIZE to $SPACE (assuming some minimum amount of free space is available). Make a default, that is settable in the .conf file: MINFREE=0 # megabytes to leave free in each volume group MINSNAP=256 # megabytes for minimum snapshot size. # make snapshot large enough to handle e.g. journal and other updates [ $SNAPSIZE -lt $MINSNAP ] && SNAPSIZE=$MINSNAP # limit snapshot to available space [ $SNAPSIZE -gt $((SPACE - MINFREE)) ] && SNAPSIZE=$((SPACE - MINFREE)) # if we don't have enough space, skip this check if [ $SNAPSIZE -lt $MINSNAP ]; then log "warning" "Check of $LV can't get ${SNAPSIZE}MB, skipping" continue fi Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From adilger at sun.com Mon Jan 28 20:59:19 2008 From: adilger at sun.com (Andreas Dilger) Date: Mon, 28 Jan 2008 13:59:19 -0700 Subject: Integrating patches in SLES10 e2fsprogs In-Reply-To: <479E08D5.3040609@redhat.com> References: <20080124211728.GA24900@webber.adilger.int> <20080127050543.GC24842@mit.edu> <20080128153802.GB17752@mit.edu> <479E08D5.3040609@redhat.com> Message-ID: <20080128205919.GW18433@webber.adilger.int> On Jan 28, 2008 10:54 -0600, Eric Sandeen wrote: > Theodore Tso wrote: > > On Mon, Jan 28, 2008 at 04:26:53PM +0100, Matthias Koenig wrote: > >>> Patch6: e2fsprogs-mdraid.patch > >>> > >>> This apparently adds a new environment variable, > >>> BLKID_SKIP_CHECK_MDRAID, which forces blkid to not detect mdraid > >>> devices. I'm not sure why. > >> Workaround for people having stale RAID signature on their disk: > >> https://bugzilla.novell.com/show_bug.cgi?id=100530 > > > > Hmm... there's got to be a better way around this. > > Won't help existing block devices, but it'd be nice to have a common > library which could be called @ mkfs time to wipe out all known > signatures... > > mkfs.xfs tries to do this, but it'd be silly to duplicate in every mkfs. Well, blkid already has a way to _detect_ a lot of filesystem signatures, so it might be relatively easy to exploit the type_array[] entries to have it zap out all of these blocks. That said, the majority of them are in the first 68kB of the filesystem (mdraid excluded) so it shouldn't be too hard to zero them out. Let's hope nobody ever uses "0x00000000" as magic. mke2fs already tries to do this, though I notice: - the zap_sector() call will skip the entire write if there is a BSD bootblock, instead of skipping only the first sector(s) and overwriting the rest... Since I don't know much about BSD bootblocks, I don't know what the right behaviour is, but I can guess we still want to zero out 4-68kB (or whatever). - it only overwrites up to sector 8 (4kB) and not further into the disk to catch e.g. reiserfs superblocks. Usually it will overwrite this anyways (GDT, bitmaps, inode table), but in some rare cases it might not. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From adilger at sun.com Mon Jan 28 17:52:16 2008 From: adilger at sun.com (Andreas Dilger) Date: Mon, 28 Jan 2008 10:52:16 -0700 Subject: forced fsck (again?) In-Reply-To: <20080126043334.GB28889@mit.edu> References: <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> <20080124043930.GG18433@webber.adilger.int> <47988245.4010904@kadzban.is-a-geek.net> <47995564.2050402@kadzban.is-a-geek.net> <20080125085557.GV18433@webber.adilger.int> <20080126043334.GB28889@mit.edu> Message-ID: <20080128175216.GU18433@webber.adilger.int> On Jan 25, 2008 23:33 -0500, Theodore Tso wrote: > > Hmm, I'm not sure I understand what it is you want to do? The fsck should > > be run as 'e2fsck -fn "$dev"' (since we already know this is ext2|ext3). > > Using "-C 0" isn't useful because we don't want progress in the output log, > > This was my fault. It means that when you run this from a tty, you > get to see the progress bar. The -s flag to logsave will strip out > the progress information. (I added logsave -s precisely for this > purpose. :-) OK, that is fine too, I wasn't sure if it would fill the log with "===". > > and "-p" without "-f" will just check the superblock. > > That's needed e2fsck -p will clean up the orphaned inode list, so that > the subsequent e2fsck -fy will return 0 if the filesystem is clean. > Without the the fsck -p, then e2fsck -fy will return 1 (because it > modified the filesystem) which we can't distinguish from the case > where the filesystem had errors. Hmm, shouldn't that be cleaned up when making a snapshot? If not, then we are stuck with the problem that you have to have writable snapshots, and that is less desirable than read-only snapshots, but not fatal I guess. > > We don't want to be fixing anything (since this should be a read-only > > snapshot) so "-fy" is also not so great. > > This is a tradeoff. e2fsck -fy requires that the snapshot have more > space (although if you run off, it's not that horrible; the snapshot > will just go invalid). Well, in my one experiment this caused the lvcheck to be unkillable, and also marked the parent offline... Maybe it was just that one time (I haven't tested extensively). > The advantage of "-fy" is that you get more > information about any errors in the filesystem, where as "-fn" may not > report as useful information. True. > > > # do everything needed to check and reset dates and counters on /dev/$1/$2. > > > function check_fs() { > > > local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX` > > > trap "rm $tmpfile ; trap - RETURN" RETURN > > > > For the log file it probably makes sense to keep this around with a > > timestamp if there is a failure. That means it is fine to generate a > > random filename temporarily, but it should be renamed to something > > meaningful (e.g. /var/log/lvfsck.$dev.$(date +%Y%m%d) or similar). > > The idea is if there is a failure we'll e-mail to the administrator; > after that, there's no real need to keep it around. Unless email is broken, for whatever reason. I suppose it might make sense to keep a single log for each device (put the timestamp inside the log) so that the space usage doesn't increase dramatically. Having logrotate do cleanup isn't so great. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From adilger at sun.com Tue Jan 29 23:56:27 2008 From: adilger at sun.com (Andreas Dilger) Date: Tue, 29 Jan 2008 16:56:27 -0700 Subject: forced fsck (again?) In-Reply-To: <479E79D2.5070406@kadzban.is-a-geek.net> References: <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> <20080124043930.GG18433@webber.adilger.int> <47988245.4010904@kadzban.is-a-geek.net> <47995564.2050402@kadzban.is-a-geek.net> <20080125085557.GV18433@webber.adilger.int> <479A94D0.9080308@kadzban.is-a-geek.net> <20080128174804.GT18433@webber.adilger.int> <479E79D2.5070406@kadzban.is-a-geek.net> Message-ID: <20080129235627.GB23836@webber.adilger.int> On Jan 28, 2008 19:56 -0500, Bryan Kadzban wrote: > >> # Assume the script won't run more than one instance at a time? > >> lvremove -f "${lvtemp##/dev}" > > > > Should check the error return and bail out of script if there is an error. > > Will that catch the "more than one instance at a time" case (e.g. if > another script run is still running e2fsck on this snapshot)? Assuming > lvremove can fail (and it probably can), it's probably a good idea to > check it in any case, but if running e2fsck makes lvremove fail (until > e2fsck finishes), that's a decent way to get rid of the comment too. > > Also, I think it'd be better to skip just the current FS, rather than an > "exit 1" type bail-out, right? It's a hard call... In some sense if there is an error we may leave a string of LVs around that are filling up the VG, but the presence of the LV (and hopefully being unable to remove it while e2fsck is running) also serves as a "locking" mechanism in case some e2fsck takes a very long time to run. I guess as long as we print something in the syslog, and the LV remains in place with a suitably clear "this isn't very useful" name, then eventually the user will notice it and delete it. > - ----- > > Create a script to transparently run fsck in the background on any > active LVM logical volumes, as long as the machine is on AC power, and > that LV has been last checked more than a configurable number of days > ago. Also create an optional configuration file to set various options > in the script. > > Signed-Off-By: Bryan Kadzban You can add a Signed-Off-By: Andreas Dilger here, as it does everything I think is needed at this point... Probably good to put a version number in the script, along with your name/email so it is clear what version a user is running. > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.7 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHnnnRS5vET1Wea5wRAw0iAJ9wcLyfBSaH5FSIJNH0YakzDCUvjwCgnJEH > lPScP39vBYIIjOQPiftgDs8= > =XjFF > -----END PGP SIGNATURE----- > #!/bin/sh > # > # lvcheck > > # Released under the GNU General Public License, either version 2 or > # (at your option) any later version. > > # Overview: > # > # Run this from cron periodically (e.g. once per week). If the > # machine is on AC power, it will run the checks; otherwise they will > # all be skipped. (If the script can't tell whether the machine is > # on AC power, it will use a setting in the configuration file > # (/etc/lvcheck.conf) to decide whether to continue with the checks, > # or abort.) > # > # The script will then decide which logical volumes are active, and > # can therefore be checked via an LVM snapshot. Each of these LVs > # will be queried to find its last-check day, and if that was more > # than $INTERVAL days ago (where INTERVAL is set in the configuration > # file as well), or if the last-check day can't be determined, then > # the script will take an LVM snapshot of that LV and run fsck on the > # snapshot. The snapshot will be set to use 1/500 the space of the > # source LV. After fsck finishes, the snapshot is destroyed. > # (Snapshots are checked serially.) > # > # Any LV that passes fsck should have its last-check time updated (in > # the real superblock, not the snapshot's superblock); any LV whose > # fsck fails will send an email notification to a configurable user > # ($EMAIL). This $EMAIL setting is optional, but its use is highly > # recommended, since if any LV fails, it will need to be checked > # manually, offline. Relevant messages are also sent to syslog. > > # Set default values for configuration params. Changes to these values > # will be overwritten on an upgrade! To change these values, use > # /etc/lvcheck.conf. > EMAIL='root' > INTERVAL=30 > AC_UNKNOWN="CONTINUE" > MINSNAP=256 > MINFREE=0 > > # send $2 to syslog, with severity $1 > # severities are emerg/alert/crit/err/warning/notice/info/debug > function log() { > local sev="$1" > local msg="$2" > local arg= > > # log warning-or-higher messages to stderr as well > [ "$sev" == "emerg" || "$sev" == "alert" || "$sev" == "crit" || \ > "$sev" == "err" || "$sev" == "warning" ] && arg=-s > > logger -t lvcheck $arg -p user."$sev" -- "$msg" > } > > # determine whether the machine is on AC power > function on_ac_power() { > local any_known=no > > # try sysfs power class first > if [ -d /sys/class/power_supply ] ; then > for psu in /sys/class/power_supply/* ; do > if [ -r "${psu}/type" ] ; then > type="`cat "${psu}/type"`" > > # ignore batteries > [ "${type}" = "Battery" ] && continue > > online="`cat "${psu}/online"`" > > [ "${online}" = 1 ] && return 0 > [ "${online}" = 0 ] && any_known=yes > fi > done > > [ "${any_known}" = "yes" ] && return 1 > fi > > # else fall back to AC adapters in /proc > if [ -d /proc/acpi/ac_adapter ] ; then > for ac in /proc/acpi/ac_adapter/* ; do > if [ -r "${ac}/state" ] ; then > grep -q on-line "${ac}/state" && return 0 > grep -q off-line "${ac}/state" && any_known=yes > elif [ -r "${ac}/status" ] ; then > grep -q on-line "${ac}/status" && return 0 > grep -q off-line "${ac}/status" && any_known=yes > fi > done > > [ "${any_known}" = "yes" ] && return 1 > fi > > if [ "$AC_UNKNOWN" == "CONTINUE" ] ; then > return 0 # assume on AC power > elif [ "$AC_UNKNOWN" == "ABORT" ] ; then > return 1 # assume on battery > else > log "err" "Invalid value for AC_UNKNOWN in the config file" > exit 1 > fi > } > > # attempt to force a check of $1 on the next reboot > function try_force_check() { > local dev="$1" > local fstype="$2" > > case "$fstype" in > ext2|ext3) > tune2fs -C 16000 "$dev" > ;; > *) > log "warning" "Don't know how to force a check on $fstype..." > ;; > esac > } > > # attempt to set the last-check time on $1 to now, and the mount count to 0. > function try_delay_checks() { > local dev="$1" > local fstype="$2" > > case "$fstype" in > ext2|ext3) > tune2fs -C 0 -T now "$dev" > ;; > *) > log "warning" "Don't know how to delay checks on $fstype..." > ;; > esac > } > > # print the date that $1 was last checked, in a format that date(1) will > # accept, or "Unknown" if we don't know how to find that date. > function try_get_check_date() { > local dev="$1" > local fstype="$2" > > case "$fstype" in > ext2|ext3) > dumpe2fs -h "$dev" 2>/dev/null | grep 'Last checked:' | \ > sed -e 's/Last checked:[[:space:]]*//' > ;; > *) > # TODO: add support for various FSes here > echo "Unknown" > ;; > esac > } > > # check the FS on $1 passively, saving output to $3. > function perform_check() { > local dev="$1" > local fstype="$2" > local tmpfile="$3" > > case "$fstype" in > ext2|ext3) > nice logsave -as "${tmpfile}" e2fsck -fn "$dev" > return $? > ;; > reiserfs) > echo Yes | nice logsave -as "${tmpfile}" fsck.reiserfs --check "$dev" > # apparently can't fail? let's hope not... > return 0 > ;; > xfs) > nice logsave -as "${tmpfile}" xfs_check "$dev" > return $? > ;; > jfs) > nice logsave -as "${tmpfile}" fsck.jfs -fn "$dev" > return $? > ;; > *) > log "warning" "Don't know how to check $fstype filesystems passively: assuming OK." > ;; > esac > } > > # do everything needed to check and reset dates and counters on /dev/$1/$2. > function check_fs() { > local vg="$1" > local lv="$2" > local fstype="$3" > local snapsize="$4" > > local tmpfile=`mktemp -t lvcheck.log.XXXXXXXXXX` > local errlog="/var/log/lvcheck-${vg}@${lv}-`date +'%Y%m%d'`" > local snaplvbase="${lv}-lvcheck-temp" > local snaplv="${snaplvbase}-`date +'%Y%m%d'`" > > # clean up any left-over snapshot LVs > for lvtemp in /dev/${vg}/${snaplvbase}* ; do > if [ -e "$lvtemp" ] ; then > # Assume the script won't run more than one instance at a time? > > log "warning" "Found stale snapshot $lvtemp: attempting to remove." > > if ! lvremove -f "${lvtemp##/dev}" ; then > log "error" "Could not delete stale snapshot $lvtemp" > return 1 > fi > fi > done > > # and create this one > lvcreate -s -l "$snapsize" -n "${snaplv}" "${vg}/${lv}" > > if perform_check "/dev/${vg}/${snaplv}" "${fstype}" "${tmpfile}" ; then > log "info" "Background scrubbing of /dev/${vg}/${lv} succeeded." > try_delay_checks "/dev/${vg}/${lv}" "$fstype" > else > log "err" "Background scrubbing of /dev/${vg}/${lv} failed: run fsck offline soon!" > try_force_check "/dev/${vg}/${lv}" "$fstype" > > if test -n "$EMAIL"; then > mail -s "Fsck of /dev/${vg}/${lv} failed!" $EMAIL < $tmpfile > fi > > # save the log file in /var/log in case mail is disabled > mv "$tmpfile" "$errlog" > fi > > rm -f "$tmpfile" > lvremove -f "${vg}/${snaplv}" > } > > # pull in configuration -- overwrite the defaults above if the file exists > [ -r /etc/lvcheck.conf ] && . /etc/lvcheck.conf > > # check whether the machine is on AC power: if not, skip fsck > on_ac_power || exit 0 > > # parse up lvscan output > lvscan 2>&1 | grep ACTIVE | awk '{print $2;}' | \ > while read DEV ; do > # remove the single quotes around the device name > DEV="`echo "$DEV" | tr -d \'`" > > # get the FS type: blkid prints TYPE="blah" > eval `blkid -s TYPE "$DEV" | cut -d' ' -f2` > > # get the last-check time > check_date=`try_get_check_date "$DEV" "$TYPE"` > > # if the date is unknown, run fsck every time the script runs. sigh. > if [ "$check_date" != "Unknown" ] ; then > # add $INTERVAL days, and throw away the time portion > check_day=`date --date="$check_date $INTERVAL days" +'%Y%m%d'` > > # get today's date, and skip the check if it's not within the interval > today=`date +'%Y%m%d'` > [ $check_day -gt $today ] && continue > fi > > # get the volume group and logical volume names > VG="`lvs --noheadings -o vg_name "$DEV"`" > LV="`lvs --noheadings -o lv_name "$DEV"`" > > # get the free space and LV size (in megs), guess at the snapshot > # size, and see how much the admin will let us use (keeping MINFREE > # available) > SPACE="`lvs --noheadings --units M --nosuffix -o vg_free "$DEV"`" > SIZE="`lvs --noheadings --units M --nosuffix -o lv_size "$DEV"`" > SNAPSIZE="`expr "$SIZE" / 500`" > AVAIL="`expr "$SPACE" - "$MINFREE"`" > > # if we don't even have MINSNAP space available, skip the LV > if [ "$MINSNAP" -gt "$AVAIL" -o "$AVAIL" -le 0 ] ; then > log "warning" "Not enough free space on volume group for ${DEV}; skipping" > continue > fi > > # make snapshot large enough to handle e.g. journal and other updates > [ "$SNAPSIZE" -lt "$MINSNAP" ] && SNAPSIZE="$MINSNAP" > > # limit snapshot to available space (VG space minus min-free) > [ "$SNAPSIZE" -gt "$AVAIL" ] && SNAPSIZE="$AVAIL" > > # don't need to check SNAPSIZE again: MINSNAP <= AVAIL, MINSNAP <= SNAPSIZE, > # and SNAPSIZE <= AVAIL, combined, means SNAPSIZE must be between MINSNAP > # and AVAIL, which is what we need -- assuming AVAIL > 0 > > # check it > check_fs "$VG" "$LV" "$TYPE" "$SNAPSIZE" > done > > #!/bin/sh > > # e2check configuration file Minor note - "lvscan configuration file". > # This file follows the pattern of sshd_config: default > # values are shown here, commented-out. > > # EMAIL > # Address to send failure notifications to. If empty, > # failure notifications will not be sent. > > #EMAIL='root' > > # INTERVAL > # Days to wait between checks. All LVs use the same > # INTERVAL, but the "days since last check" value can > # be different per LV, since that value is stored in > # the filesystem superblock. > > #INTERVAL=30 > > # AC_UNKNOWN > # Whether to run the e2fsck checks if the script can't > # determine whether the machine is on AC power. Laptop > # users will want to set this to ABORT, while server and > # desktop users will probably want to set this to > # CONTINUE. Those are the only two valid values. > > #AC_UNKNOWN="CONTINUE" > > # MINSNAP > # Minimum snapshot size to take, in megabytes. The > # default snapshot size is 1/500 the size of the logical > # volume, but if that size is less than MINSNAP, the > # script will use MINSNAP instead. This should be large > # enough to handle e.g. journal updates, and other disk > # changes that require (semi-)constant space. > > #MINSNAP=256 > > # MINFREE > # Minimum amount of space (in megabytes) to keep free in > # each volume group when creating snapshots. > > #MINFREE=0 > Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. From gregt at maths.otago.ac.nz Wed Jan 30 02:01:58 2008 From: gregt at maths.otago.ac.nz (Greg Trounson) Date: Wed, 30 Jan 2008 15:01:58 +1300 Subject: forced fsck (again?) In-Reply-To: <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> Message-ID: <479FDA96.5080209@maths.otago.ac.nz> Valerie Henson wrote: ... > This will be ironic coming from me, but I think the ext3 defaults for > forcing a file system check are a little too conservative for many > modern use cases. The two cases I have in mind in particular are: > > * Servers with long uptimes that need very low data unavailability > times. Imagine you have a machine room full of servers that have all > been up and running happily for more than 180 days - the preferred > case. Now imagine that the room overheats and the emergency power cut > is tripped. Standard heat reduction is swiftly applied (i.e., open > the door and turn on a fan and hope security doesn't notice) and the > power turned back on. Now your entire machine room will be fscking > for the next 3 hours and whatever service they provide will be > completely unavailable. Of course, any admin worth their salt will > turn off force fsck so it only runs during controlled downtime... > won't they? Agreed. This is a real problem. And controlled downtime is rather difficult if it takes several hours to complete. You're either without whatever services they provide or with reduced redundancy for that time. > * Laptops. If suspend and resume doesn't work on your laptop, you'll > be rebooting (and remounting) a lot, perhaps several times a day. The > preferred solution is to get Matthew Garrett to fix your laptop, but > if you can't, fscking every 10-30 days seems a little excessive. > Desktop users who shutdown daily to save power will have similar > problems. Distros often have the "don't fsck on battery" option and > some don't use the ext3 defaults for mkfs, but that's only a partial > solution. In this case, it's definitely a little much to ask a random > laptop user to tune their file system. Agreed again. Having a laptop insist on an fsck when about to give a presentation to a room full of professors is really not a good look. And being flimsier and more abused than desktops, laptops IMO really do need regular checking. > I'm not sure what the best solution is ... I am. Since fscks are unacceptably inconvenient and apparently the only thing worse than enforcing periodic fscks is *not* enforcing periodic fscks, then we only have one option. Make fscks less inconvenient. And since we apparently can't make them any faster, the only way I can think of to do that is to add support for (you know what I'm going to say): Online fscks. We really, *really* need to support checking of mounted read/write file systems. I would envisage a read-only fsck done on all mounted filesystems regularly, which wouldn't do any damage to a file system if implemented properly. If an inconsistency is picked up, then recommend an offline one to be scheduled when the user/admin is ready. Greg From chris.mason at oracle.com Mon Jan 14 18:16:48 2008 From: chris.mason at oracle.com (Chris Mason) Date: Mon, 14 Jan 2008 18:16:48 -0000 Subject: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66) In-Reply-To: <20080114170609.GH4214@duck.suse.cz> References: <200712242302.lBON2O8s011190@agora.fsl.cs.sunysb.edu> <477BF72B.4000608@oracle.com> <20080114170609.GH4214@duck.suse.cz> Message-ID: <20080114131454.37eb7c12@think.oraclecorp.com> On Mon, 14 Jan 2008 18:06:09 +0100 Jan Kara wrote: > On Wed 02-01-08 12:42:19, Zach Brown wrote: > > Erez Zadok wrote: > > > Setting: ltp-full-20071031, dio01 test on ext3 with Linus's > > > latest tree. Kernel w/ SMP, preemption, and lockdep configured. > > > > This is a real lock ordering problem. Thanks for reporting it. > > > > The updating of atime inside sys_mmap() orders the mmap_sem in the > > vfs outside of the journal handle in ext3's inode dirtying: > > [ lock inversion traces ] > > Two fixes come to mind: > > > > 1) use something like Peter's ->mmap_prepare() to update atime > > before acquiring the mmap_sem. > > ( http://lkml.org/lkml/2007/11/11/97 ). I don't know if this would > > leave more paths which do a journal_start() while holding the > > mmap_sem. > > > > 2) rework ext3's dio to only hold the jbd handle in > > ext3_get_block(). Chris has a patch for this kicking around > > somewhere but I'm told it has problems exposing old blocks in > > ordered data mode. > > > > Does anyone have preferences? I could go either way. I certainly > > don't like the idea of journal handles being held across the > > entirety of fs/direct-io.c. It's yet another case of O_DIRECT > > differing wildly from the buffered path :(. > I've looked more into it and I think that 2) is the only way to go > since transaction start ranks below page lock (standard buffered > write path) and page lock ranks below mmap_sem. So we have at least > one more dependency mmap_sem must go before transaction start... Just to clarify a little bit: If ext3's DIO code only touches transactions in get_block, then it can violate data=ordered rules. Basically the transaction that allocates the blocks might commit before the DIO code gets around to writing them. A crash in the wrong place will expose stale data on disk. -chris From menscher at gmail.com Thu Jan 24 08:24:19 2008 From: menscher at gmail.com (Damian Menscher) Date: Thu, 24 Jan 2008 00:24:19 -0800 Subject: forced fsck (again?) In-Reply-To: <4797F397.9020306@kadzban.is-a-geek.net> References: <200801221701.50202.giancarlo.corti@supsi.ch> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> <47969D69.4060607@kadzban.is-a-geek.net> <20080123031012.GD1320@mit.edu> <4796B60F.4040009@kadzban.is-a-geek.net> <20080123081548.GY3180@webber.adilger.int> <20080123140847.GB29321@mit.edu> <20080123192334.GG3180@webber.adilger.int> <4797F397.9020306@kadzban.is-a-geek.net> Message-ID: <1d8411e00801240024yf31af33tb202e0bef44b5ec9@mail.gmail.com> At the risk of adding complexity, what about having the SNAPSIZE be automatically determined? Most users would have no idea what to set it to, and we should be able to guess some reasonable values. For example, the fsck time can probably be estimated by looking at the number of inodes, how full the filesystem is, etc. Alternatively, we could just allocate all available space in the LVM. I also have a newbie question: does the fsck of a snapshot really catch everything that might be wrong with the drive, or are there other failure modes that only a real fsck would catch? I'm wondering if it's still a good idea to do an occasional full fsck. Damian 2008/1/23 Bryan Kadzban : > -----BEGIN PGP SIGNED MESSAGE----- > Hash: RIPEMD160 > > Andreas Dilger wrote: > > On Jan 23, 2008 09:08 -0500, Theodore Tso wrote: > >> (We could sneek some of that information into the options field of > >> fstab, since the kernel and other programs that parse that field > >> just take what they need and ignore the rest, but.... ick, ick, > >> ick. :-) > > > > I agree - adding email to fstab is icky and I wouldn't go there. I > > don't see a problem with just emailing it to "root@" by default and > > giving the user the option to change it to something else. > > Since the email address is not per-filesystem, it's fine by me to put it > into a config file somewhere. Forcing the interval to be global is > probably also OK, although I wouldn't want to be forced to set the > snapshot size globally. I do think that fstab is the best place for > per-filesystem options, though. > > But it's not too difficult to parse out a custom SNAPSIZE option, and > even have a DEFAULT_SNAPSIZE in the config file if no SNAPSIZE option is > present on any LV, if the script is going to parse fstab anyway. (Or > should the option's name be lowercase? Either will work.) > > >> Also, I could imagine that a user might not want to check all of > >> the filesystems in fstab. > > > > Similarly, a config file which disables checking on some LV if > > specified seems reasonable. > > That does seem reasonable, but I haven't done it in the script that's > attached. Maybe support for a SKIP (or skip, or e2check_skip, or > skip_e2check, or whatever) option in fstab's options field? > > Regarding the idea of having this support multiple filesystems -- that's > a good idea, I think, but the current script is highly specific to ext2 > or ext3. Use of tune2fs (to reset the last-check time) and dumpe2fs (to > find the last-check time), in particular, will be problematic on other > FSes. I haven't done that in this script, though it may be possible. > > Anyway, here's a second version. I've changed it to parse up fstab, > and added an option for what to do if AC status can't be determined. > Kernel-style changelog entry, etc., below: > > - ------- > > Create a script to transparently run e2fsck in the background on any LVM > logical volumes listed in /etc/fstab, as long as the machine is on AC > power, and that LV has been last checked more than a configurable number > of days ago. Also create a configuration file to set various options in > the script. > > Signed-Off-By: Bryan Kadzban > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.7 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHl/OXS5vET1Wea5wRA/UaAJwIE27W6qasI7Gm/uvZm/pY1rcBtwCcDXYq > cc3qE/uOEqm4ksYHlI6+IJU= > =7Lf3 > -----END PGP SIGNATURE----- > > #!/bin/sh > # > # e2check > > # Released under the GNU General Public License, either version 2 or > # (at your option) any later version. > > # Overview: > # > # Run this from cron each night. If the machine is on AC power, it > # will run the checks; otherwise they will all be skipped. (If the > # script can't tell whether the machine is on AC power, a setting in > # the configuration file (/etc/e2check.conf) decides whether it will > # continue or abort.) > # > # The script will then decide which filesystems in /etc/fstab are on > # logical volumes, and can therefore be checked via an LVM snapshot. > # Each of these filesystems will be queried to find its last check > # day, and if that was more than $INTERVAL days ago (where INTERVAL > # is set in the configuration file as well), then the script will > # take an LVM snapshot of the filesystem and run e2fsck on the > # snapshot. The snapshot's size can be set via either the SNAPSIZE > # option in the options field in /etc/fstab, or the DEFAULT_SNAPSIZE > # option in /etc/e2check.conf -- but make sure it's set large enough. > # After e2fsck finishes, the snapshot is destroyed. > # > # Any filesystem that passes e2fsck will have its last-check time > # updated (in the real superblock, not the snapshot); any filesystem > # that fails will send an email notification to a configurable user > # ($EMAIL). This $EMAIL setting is optional, but its use is highly > # recommended, since if any filesystem fails, it will need to be > # checked manually offline. > > function on_ac_power() { > local any_known=no > > # try sysfs power class first > if [ -d /sys/class/power_supply ] ; then > for psu in /sys/class/power_supply/* ; do > if [ -r "${psu}/type" ] ; then > type="`cat "${psu}/type"`" > > # ignore batteries > [ "${type}" = "Battery" ] && continue > > online="`cat "${psu}/online"`" > > [ "${online}" = 1 ] && return 0 > [ "${online}" = 0 ] && any_known=yes > fi > done > > [ "${any_known}" = "yes" ] && return 1 > fi > > # else fall back to AC adapters in /proc > if [ -d /proc/acpi/ac_adapter ] ; then > for ac in /proc/acpi/ac_adapter/* ; do > if [ -r "${ac}/state" ] ; then > grep -q on-line "${ac}/state" && return 0 > grep -q off-line "${ac}/state" && any_known=yes > elif [ -r "${ac}/status" ] ; then > grep -q on-line "${ac}/status" && return 0 > grep -q off-line "${ac}/status" && any_known=yes > fi > done > > [ "${any_known}" = "yes" ] && return 1 > fi > > if [ "$AC_UNKNOWN" == "CONTINUE" ] ; then > return 0 # assume on AC power > elif [ "$AC_UNKNOWN" == "ABORT" ] ; then > return 1 # assume on battery > else > echo "Invalid value for AC_UNKNOWN in the config file" >&2 > exit 1 > fi > } > > function check_fs() { > local vg="$1" > local lv="$2" > local opts="$3" > local snapsize="${DEFAULT_SNAPSIZE}" > > case "$opts" in > *SNAPSIZE=*) > # parse out just the SNAPSIZE option's value > snapsize="${opts##*SNAPSIZE=}" > snapsize="${snapsize%%,*}" > ;; > esac # else leave it at DEFAULT_SNAPSIZE > > [ -z "$snapsize" ] && return 1 > > local tmpfile=`mktemp -t e2fsck.log.XXXXXXXXXX` > trap "rm $tmpfile ; trap - RETURN" RETURN > > local start="$(date +'%Y%m%d%H%M%S')" > > lvcreate -s -L "${snapsize}" -n "${lv}-snap" "${vg}/${lv}" > > if nice logsave -as $tmpfile e2fsck -p -C 0 "/dev/${vg}/${lv}-snap" && \ > nice logsave -as $tmpfile e2fsck -fy -C 0 "/dev/${vg}/${lv}-snap" ; then > echo 'Background scrubbing succeeded!' > tune2fs -C 0 -T "${start}" "/dev/${vg}/${lv}" > else > echo 'Background scrubbing failed! Reboot to fsck soon!' > tune2fs -C 16000 -T "19000101" "/dev/${vg}/${lv}" > > if test -n "$EMAIL"; then > mail -s "E2fsck of /dev/${vg}/${lv} failed!" $EMAIL < $tmpfile > fi > fi > > lvremove -f "${vg}/${lv}-snap" > } > > set -e > > # pull in configuration -- don't bother with a parser, just use the shell's > . /etc/e2check.conf > > # check whether the machine is on AC power: if not, skip the e2fsck > on_ac_power || exit 0 > > # parse up fstab > grep -v '^#' /etc/fstab | grep -v '^$' | awk '$6!=0 {print $1,$3,$4;}' | \ > while read FS FSTYPE OPTIONS ; do > # Use of tune2fs in check_fs, and dumpe2fs below, means we can > # only handle ext2/ext3 FSes > [ "$FSTYPE" != "ext3" || "$FSTYPE" != "ext2" ] && continue > > # get the volume group (or an error message) > VG="`lvs --noheadings -o vg_name "$FS" 2>&1`" > > # skip non-LVM devices (hopefully LVM VGs don't have spaces) > [ "`echo "$VG" | awk '{print NF;}'`" -ne 1 ] && continue > > # get the logical volume name > LV="`lvs --noheadings -o lv_name "$FS"`" > > # get the last check time plus $INTERVAL days > check_date=`dumpe2fs -h "/dev/${VG}/${LV}" 2>/dev/null | grep 'Last checked:' | \ > sed -e 's/Last checked:[[:space:]]*//'` > check_day=`date --date="${check_date} $INTERVAL days" +"%Y%m%d"` > > # get today's date, and skip LVs that don't need to be checked yet > today=`date +"%Y%m%d"` > [ "$check_day" -gt "$today" ] && continue > > # else, check it > check_fs "$VG" "$LV" "$OPTIONS" > done > > > #!/bin/sh > > # e2check configuration variables: > # > # EMAIL > # Address to send failure notifications to. If empty, > # failure notifications will not be sent. > # > # INTERVAL > # Days to wait between checks. All LVs use the same > # INTERVAL, but the "days since last check" value can > # be different per LV, since that value is stored in > # the ext2/ext3 superblock. > # > # DEFAULT_SNAPSIZE > # Default snapshot size to use if none is specified > # in the options field in /etc/fstab (using the custom > # SNAPSIZE=xxx option) for any LV. Valid values are > # anything that the -L option to lvcreate will accept. > # > # AC_UNKNOWN > # Whether to run the e2fsck checks if the script can't > # determine whether the machine is on AC power. Laptop > # users will want to set this to ABORT, while server and > # desktop users will probably want to set this to > # CONTINUE. Those are the only two valid values. > > EMAIL='root' > INTERVAL=30 > DEFAULT_SNAPSIZE=100m > AC_UNKNOWN="ABORT" > > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users > -- http://www.uiuc.edu/~menscher/ From chris.mason at oracle.com Fri Jan 25 16:16:13 2008 From: chris.mason at oracle.com (Chris Mason) Date: Fri, 25 Jan 2008 11:16:13 -0500 Subject: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66) In-Reply-To: <20080125160931.GC1767@duck.suse.cz> References: <200712242302.lBON2O8s011190@agora.fsl.cs.sunysb.edu> <20080114131454.37eb7c12@think.oraclecorp.com> <20080125160931.GC1767@duck.suse.cz> Message-ID: <200801251116.13690.chris.mason@oracle.com> On Friday 25 January 2008, Jan Kara wrote: > > If ext3's DIO code only touches transactions in get_block, then it can > > violate data=ordered rules. Basically the transaction that allocates > > the blocks might commit before the DIO code gets around to writing them. > > > > A crash in the wrong place will expose stale data on disk. > > Hmm, I've looked at it and I don't think so - look at the rationale in > the patch below... That patch should fix the lock-inversion problem (at > least I see no lockdep warnings on my test machine). > Ah ok, when I was looking at this I was allowing holes to get filled without falling back to buffered. But, with the orphan inode entry protecting things I see how you're safe with this patch. -chris From daviso at gmail.com Thu Jan 31 22:38:51 2008 From: daviso at gmail.com (Davi Santos Oliveira) Date: Thu, 31 Jan 2008 20:38:51 -0200 Subject: Ext3 Repair Message-ID: Hello, First, sorry for my english. I'm new in this list, and I'm having troubles because a lack of disk on my Raid 5. The server have a LVM system on the Raid 5 and the partitions on LVM is ext3. I can't identify where the ext3 superblock is on this LVM partition to use the fsck. I've tried many ways: fsck -b 8192 /dev/VolGroup/LogVol04 dumpe2fs /dev/VolGroup/LogVol04 | grep -i superblock I tried to use the testdisk, and nothing of these solves my problem, i need to recover the files from an ext3 partition or repair this partition, what sounds better to me. Can anyone helps me? []'s -- Davi Santos Oliveira -------------- next part -------------- An HTML attachment was scrubbed... URL: From mb--ext3 at dcs.qmul.ac.uk Thu Jan 31 16:27:48 2008 From: mb--ext3 at dcs.qmul.ac.uk (Matt Bernstein) Date: Thu, 31 Jan 2008 16:27:48 +0000 (GMT) Subject: forced fsck (again?) In-Reply-To: <20080122225248.GD1659@mit.edu> References: <200801221701.50202.giancarlo.corti@supsi.ch> <4796157E.5040803@redhat.com> <479615CA.1090408@redhat.com> <70b6f0bf0801221434r1f03b591w7525b7110dab27a8@mail.gmail.com> <20080122225248.GD1659@mit.edu> Message-ID: On Jan 22 Theodore Tso wrote: > #!/bin/sh > # > # e2croncheck > > VG=closure > VOLUME=root > SNAPSIZE=100m > EMAIL=tytso at mit.edu [snip] > Well, this isn't a complete solution, because a lot of people don't > use LVM Please forgive my late noticing of this. The idea is good, and will work fine in 99% of cases. I'd love to snapshot (for rsync as well as fsck) my large filesystems, which have external journals which in turn are in a different VG. I suspect that if I were to na?vely run your script, really interesting things would be likely to happen ;) So.. I'd love to atomically make two snapshots, but I guess that is Hard (or would at least require a very coarse lock). I suppose in the meantime I could "tune2fs -O ^has_journal" the snapshot volume, but I'm too scared even to do that. So.. maybe I could request that you either include a Big Fat Disclaimer, or code based on the following (untested, you can probably do better)? if (tune2fs -l /dev/${VG}/${VOLUME}|egrep -q "Journal device") then echo "Cowardly refusing to play with external journals." echo "There be dragons!" exit 1 fi