From jelledejong at powercraft.nl Wed Oct 8 16:28:13 2014 From: jelledejong at powercraft.nl (Jelle de Jong) Date: Wed, 08 Oct 2014 18:28:13 +0200 Subject: CF Card wear optimalisation for ext4 Message-ID: <5435661D.2040905@powercraft.nl> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello everyone, I been using CF cards for almost more then 7 years now with ext file-system without any major problems on ALIX boards. Last year I took 30 other systems in production with ext4 and the CF cards been dropping out pretty fast, it may have been a bad batch but I do want to look at it. I don't think the devices writes a lot of IO (is there a tool that can give me some useful numbers for say 24H or a week? iotop, atop, sysstat doesn?t seem suited for long term IO write monitoring, but maybe I am misusing them and can use some help here) I mount root with the following options: /dev/disk/by-uuid/09a04c01-64c6-4600-9e22-525667bda3e3 on / type ext4 (rw,noatime,user_xattr,barrier=1,data=ordered) # dumpe2fs /dev/sda1 http://paste.debian.net/hidden/e3f81f11/ Are there kernel options to avoid synchronous disk writes? As suggested here: http://www.pcengines.ch/cfwear.htm Is there a list of other kernel options I can optimise to limit any cf wear? The devices don't use Kind regards Jelle de Jong -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iJwEAQECAAYFAlQ1ZhgACgkQ1WclBW9j5HmulgP9Fayd7V9t7bRgLo6NmjDVZoDM f+kH54/EnjsRfoKYYZDSO38WlwBWqJ1cFc+w2W2PMWKqJiL7QNk2+qMsSFeTCtLq JTz/e2ItLNqFJTAtX0bROgbEmNubfLQyvli+q/HspeSHGKKHjBzHelv5PDBciQCm vJEVwtB3Vb22hsWfSmw= =oCCY -----END PGP SIGNATURE----- From adilger at dilger.ca Fri Oct 10 19:02:09 2014 From: adilger at dilger.ca (Andreas Dilger) Date: Fri, 10 Oct 2014 13:02:09 -0600 Subject: CF Card wear optimalisation for ext4 In-Reply-To: <5435661D.2040905@powercraft.nl> References: <5435661D.2040905@powercraft.nl> Message-ID: <08F28BC7-34FD-46BA-9D91-CC7D57A4A4D5@dilger.ca> On Oct 8, 2014, at 10:28 AM, Jelle de Jong wrote: > Hello everyone, > > I been using CF cards for almost more then 7 years now with ext > file-system without any major problems on ALIX boards. > > Last year I took 30 other systems in production with ext4 and the CF > cards been dropping out pretty fast, it may have been a bad batch but > I do want to look at it. I don't think the devices writes a lot of IO > (is there a tool that can give me some useful numbers for say 24H or a > week? iotop, atop, sysstat doesn?t seem suited for long term IO write > monitoring, but maybe I am misusing them and can use some help here) You can see in the ext4 superblock the amount of data that has been written to a filesystem over its lifetime: dumpe2fs -h /dev/vg_mookie/lv_home dumpe2fs 1.42.7.wc2 (07-Nov-2013) Filesystem volume name: home Last mounted on: /home : : Lifetime writes: 27 GB : : Note that this number isn't wholly accurate, but rather a guideline. IIRC it is not updated on disk all the time, so may lose writes. You can also get this information from /sys/fs/ext4 including data just for the current mount: # grep . /sys/fs/ext4/*/*_write_kbytes /sys/fs/ext4/dm-0/lifetime_write_kbytes:77632360 /sys/fs/ext4/dm-0/session_write_kbytes:7124948 /sys/fs/ext4/dm-19/lifetime_write_kbytes:28081448 /sys/fs/ext4/dm-19/session_write_kbytes:16520 /sys/fs/ext4/dm-2/lifetime_write_kbytes:60847858 /sys/fs/ext4/dm-2/session_write_kbytes:7739388 /sys/fs/ext4/dm-7/lifetime_write_kbytes:22385952 /sys/fs/ext4/dm-7/session_write_kbytes:6379728 /sys/fs/ext4/sda1/lifetime_write_kbytes:835020 /sys/fs/ext4/sda1/session_write_kbytes:60848 > I mount root with the following options: > > /dev/disk/by-uuid/09a04c01-64c6-4600-9e22-525667bda3e3 on / type ext4 > (rw,noatime,user_xattr,barrier=1,data=ordered) > > # dumpe2fs /dev/sda1 > http://paste.debian.net/hidden/e3f81f11/ > > Are there kernel options to avoid synchronous disk writes? As > suggested here: http://www.pcengines.ch/cfwear.htm If you increase the journal commit interval (e.g. 30s) you can reduce the number of times a block needs to be written to the journal. The drawback is that you also increase the amount of un-sync'd metadata that would be lost in case of a crash. This usually means the data would also be lost, unless you are using a database-like workload that overwrites the same files continuously. > Is there a list of other kernel options I can optimise to limit any cf > wear? The devices don't use > > Kind regards > > Jelle de Jong > > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users Cheers, Andreas -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: Message signed with OpenPGP using GPGMail URL: From tytso at mit.edu Sat Oct 11 23:19:48 2014 From: tytso at mit.edu (Theodore Ts'o) Date: Sat, 11 Oct 2014 19:19:48 -0400 Subject: CF Card wear optimalisation for ext4 In-Reply-To: <08F28BC7-34FD-46BA-9D91-CC7D57A4A4D5@dilger.ca> References: <5435661D.2040905@powercraft.nl> <08F28BC7-34FD-46BA-9D91-CC7D57A4A4D5@dilger.ca> Message-ID: <20141011231948.GC6262@thunk.org> Something else that you might want to do is count the number of journal commits that are taking place, via a command like this: perf stat -e jbd2:jbd2_start_commit -a sleep 3600 This will count the number of jbd2 commits are executed in 3600 seconds --- i.e., an hour. If you are running some workload which is constantly calling fsync(2), that will be forcing journal commits, and those turn into cache flush commands that force all state to stable storage. Now, if you are using CF cards that aren't guaranteed to have power-loss protection (hint: even most consumer grade SSD's do not have power loss protection --- you have to pay $$$ for enterprise-grade SLC SSD's to have power loss protection --- and I'm guessing most CF cards are so cheap that they won't make guarantees that all of their flash metadata are saved to stable store on a power loss event) the fact that you are constantly using fsync(2) may not be providing you with the protection you want after a power loss event. Which might not be a problem if you have a handset with a non-removable eMMC device and a non-removable battery that can't fly out when you drop the phone, but for devices which which can easily have unplanned power failure, it may every well be the case that you're going to be badly burned across a power fail event anyway. So the next question I would ask you is whether you care about unplanned power failures. If so, you probably want to test your CF cards to make sure they actually will do the right thing across a power failure --- and if they don't, you may need to replace your CF card provider. If you don't care (because you don't have a removable battery, and the CF card is permanently sealed inside your device, for example), then you might want to consider disabling barriers so you're no longer forcing synchronous cache flush commands to be sent to your CF card. This trades off power failure safety versus increased performance and decreased card wear --- but if you don't need power failure safety, then it might be a good tradeoff. And if you *do* need power fail protection, then it's a good thing to test whether your hardware will actually provide it, so you don't find out the hard way that you're paying the cost of decreased performance and increased card wear, but you didn't get power fail protection *anyway* because of hardware limitations. Cheers, - Ted From ibaldo at adinet.com.uy Sun Oct 12 14:07:47 2014 From: ibaldo at adinet.com.uy (Ivan Baldo) Date: Sun, 12 Oct 2014 12:07:47 -0200 Subject: power loss protection In-Reply-To: <20141011231948.GC6262@thunk.org> References: <5435661D.2040905@powercraft.nl> <08F28BC7-34FD-46BA-9D91-CC7D57A4A4D5@dilger.ca> <20141011231948.GC6262@thunk.org> Message-ID: <543A8B33.7020401@adinet.com.uy> Hello. El 11/10/14 21:19, Theodore Ts'o escribi?: > If you are running some workload which is constantly calling fsync(2), > that will be forcing journal commits, and those turn into cache flush > commands that force all state to stable storage. Now, if you are > using CF cards that aren't guaranteed to have power-loss protection > (hint: even most consumer grade SSD's do not have power loss > protection --- you have to pay $$$ for enterprise-grade SLC SSD's to > have power loss protection --- and I'm guessing most CF cards are so > cheap that they won't make guarantees that all of their flash metadata > are saved to stable store on a power loss event) the fact that you are > constantly using fsync(2) may not be providing you with the protection > you want after a power loss event. > > This got me worried! How can we test if a device really stores all the data safely after a barrier and sudden power loss? Is there a tool for that? I am thinking something along the lines of a tool that does writes with some barriers in between and then I unplug the device and run the same tool but in a "check mode" that tells me if the requested data before the barrier is really there. Something sysadmin friendly or maybe even user friendly, but not too hard to use. Thanks for your insight! -- Ivan Baldo - ibaldo at adinet.com.uy - http://ibaldo.codigolibre.net/ From Montevideo, Uruguay, at the south of South America. Freelance programmer and GNU/Linux system administrator, hire me! Alternatives: ibaldo at codigolibre.net - http://go.to/ibaldo From squadra at gmail.com Sun Oct 12 17:53:52 2014 From: squadra at gmail.com (squadra) Date: Sun, 12 Oct 2014 19:53:52 +0200 Subject: power loss protection In-Reply-To: <543A8B33.7020401@adinet.com.uy> References: <5435661D.2040905@powercraft.nl> <08F28BC7-34FD-46BA-9D91-CC7D57A4A4D5@dilger.ca> <20141011231948.GC6262@thunk.org> <543A8B33.7020401@adinet.com.uy> Message-ID: dunno about any special tools, but misusing a mysql database could be a good check for this. unplug/reset your device while inserts into the db are ongoing (dont forget to use innodb for the tables). unplug / reset your device, boot it up again and take a look into the mysql log. theres a good chance that innodb gets wrecked... sure, this is not perfect. but could be a impressive test if it ends like i think. make sure your mysql instance is configured to be "safe": http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html#sysvar_innodb_flush_method http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit and enable binlogs + sync binlogs or in other words: make it as slow as possible :p On Sun, Oct 12, 2014 at 4:07 PM, Ivan Baldo wrote: > Hello. > > El 11/10/14 21:19, Theodore Ts'o escribi?: >> >> If you are running some workload which is constantly calling fsync(2), >> that will be forcing journal commits, and those turn into cache flush >> commands that force all state to stable storage. Now, if you are >> using CF cards that aren't guaranteed to have power-loss protection >> (hint: even most consumer grade SSD's do not have power loss >> protection --- you have to pay $$$ for enterprise-grade SLC SSD's to >> have power loss protection --- and I'm guessing most CF cards are so >> cheap that they won't make guarantees that all of their flash metadata >> are saved to stable store on a power loss event) the fact that you are >> constantly using fsync(2) may not be providing you with the protection >> you want after a power loss event. >> >> > This got me worried! > How can we test if a device really stores all the data safely after a > barrier and sudden power loss? > Is there a tool for that? > I am thinking something along the lines of a tool that does writes with > some barriers in between and then I unplug the device and run the same tool > but in a "check mode" that tells me if the requested data before the barrier > is really there. > Something sysadmin friendly or maybe even user friendly, but not too > hard to use. > Thanks for your insight! > > -- > Ivan Baldo - ibaldo at adinet.com.uy - http://ibaldo.codigolibre.net/ > From Montevideo, Uruguay, at the south of South America. > Freelance programmer and GNU/Linux system administrator, hire me! > Alternatives: ibaldo at codigolibre.net - http://go.to/ibaldo > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users -- Sent from the Delta quadrant using Borg technology! From bothie at gmx.de Thu Oct 16 16:25:55 2014 From: bothie at gmx.de (Bodo Thiesen) Date: Thu, 16 Oct 2014 18:25:55 +0200 Subject: CF Card wear optimalisation for ext4 In-Reply-To: <08F28BC7-34FD-46BA-9D91-CC7D57A4A4D5@dilger.ca> References: <5435661D.2040905@powercraft.nl> <08F28BC7-34FD-46BA-9D91-CC7D57A4A4D5@dilger.ca> Message-ID: <20141016182555.1f0798df@phenom> * Andreas Dilger hat geschrieben: > You can see in the ext4 superblock the amount of data that has been > written to a filesystem over its lifetime: > > Note that this number isn't wholly accurate, but rather a guideline. Is is more like a completely bogus value at best: # LANG=C df -h / | grep root /dev/root 3.7T 3.6T 73G 99% / # grep [0-9] /proc/partitions 8 0 3907018584 sda # tune2fs -l /dev/sda | grep Lifetime Lifetime writes: 2503 GB 3.7 TB Disk/Partition, 3.6 TB space in use but only 2.4 TB writes. No, there are no 1.2 TB + x allocated but never written to clusters on that file system. And if /sys/fs/ext4/*/*_write_kbytes are as correct as the "Lifetime writes" value, than the correct answer to Jelle's question is: "There is no way currently to figure out the actual number of writes to a device". Regards, Bodo From adilger at dilger.ca Thu Oct 16 19:33:11 2014 From: adilger at dilger.ca (Andreas Dilger) Date: Thu, 16 Oct 2014 13:33:11 -0600 Subject: CF Card wear optimalisation for ext4 In-Reply-To: <20141016182555.1f0798df@phenom> References: <5435661D.2040905@powercraft.nl> <08F28BC7-34FD-46BA-9D91-CC7D57A4A4D5@dilger.ca> <20141016182555.1f0798df@phenom> Message-ID: On Oct 16, 2014, at 10:25 AM, Bodo Thiesen wrote: > * Andreas Dilger hat geschrieben: > >> You can see in the ext4 superblock the amount of data that has been >> written to a filesystem over its lifetime: >> >> Note that this number isn't wholly accurate, but rather a guideline. > > Is is more like a completely bogus value at best: > > # LANG=C df -h / | grep root > /dev/root 3.7T 3.6T 73G 99% / > # grep [0-9] /proc/partitions > 8 0 3907018584 sda > # tune2fs -l /dev/sda | grep Lifetime > Lifetime writes: 2503 GB > > 3.7 TB Disk/Partition, 3.6 TB space in use but only 2.4 TB writes. > > No, there are no 1.2 TB + x allocated but never written to clusters on > that file system. > > And if /sys/fs/ext4/*/*_write_kbytes are as correct as the "Lifetime > writes" value, than the correct answer to Jelle's question is: "There is > no way currently to figure out the actual number of writes to a device". The "lifetime writes" value has not been around forever, so if the filesystem was originally created and populated on an older kernel (e.g. using ext3) it would not contain a record of those writes. There is also some potential loss if the filesystem isn't unmounted cleanly. It definitely _can_ be used to monitor the writes to a particular filesystem over the past 24h, which is what the original poster was asking about. Cheers, Andreas -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: Message signed with OpenPGP using GPGMail URL: From bothie at gmx.de Thu Oct 16 21:01:35 2014 From: bothie at gmx.de (Bodo Thiesen) Date: Thu, 16 Oct 2014 23:01:35 +0200 Subject: CF Card wear optimalisation for ext4 In-Reply-To: References: <5435661D.2040905@powercraft.nl> <08F28BC7-34FD-46BA-9D91-CC7D57A4A4D5@dilger.ca> <20141016182555.1f0798df@phenom> Message-ID: <20141016230135.7cbcdbf0@phenom> * Andreas Dilger hat geschrieben: > The "lifetime writes" value has not been around forever, so if the > filesystem was originally created and populated on an older kernel > (e.g. using ext3) it would not contain a record of those writes. It was created as stable ext4 in the first place. So only if there was a stable ext4 release which didn't update the "lifetime writes" value this could be the case. > There is also some potential loss if the filesystem isn't unmounted > cleanly. Yea, that *might* be it - but that only supports my statement, that this value is mainly bogus. > It definitely _can_ be used to monitor the writes to a particular > filesystem over the past 24h, which is what the original poster was > asking about. Since it never get's updated unless the file system is unmounted, it can only be used for a 24 hours test by mounting the file system now, unmounting it 24 hours from now and then taking the difference. Also the value is only available in granularity of 1 GB (plus minus 512MB) - at least in my case. So, in any case, I wouldn't trust that value for any purposes at all. I did test /sys/fs/ext4/sda/lifetime_write_kbytes now, that seems to be somewhat less bogus, so *that* might actuall be usable for the 24 hours test. But I wasn't talking about that when I said, that this lifetime thing is bogus. Regards, Bodo From tytso at mit.edu Fri Oct 17 15:43:06 2014 From: tytso at mit.edu (Theodore Ts'o) Date: Fri, 17 Oct 2014 11:43:06 -0400 Subject: CF Card wear optimalisation for ext4 In-Reply-To: <20141016230135.7cbcdbf0@phenom> References: <5435661D.2040905@powercraft.nl> <08F28BC7-34FD-46BA-9D91-CC7D57A4A4D5@dilger.ca> <20141016182555.1f0798df@phenom> <20141016230135.7cbcdbf0@phenom> Message-ID: <20141017154306.GC15062@thunk.org> On Thu, Oct 16, 2014 at 11:01:35PM +0200, Bodo Thiesen wrote: > > Since it never get's updated unless the file system is unmounted, it can > only be used for a 24 hours test by mounting the file system now, > unmounting it 24 hours from now and then taking the difference. It also gets updated if the file system syncfs(2) or sync(2) system call. But if you crash, any writes since the last syncfs(2), sync(2), or umount(2) call on the file system can get lost, yes. > Also the value is only available in granularity of 1 GB (plus minus > 512MB) - at least in my case. This is what dumpe2fs is currently using: if (sb->s_kbytes_written) { fprintf(f, "Lifetime writes: "); if (sb->s_kbytes_written < POW2(13)) fprintf(f, "%llu kB\n", sb->s_kbytes_written); else if (sb->s_kbytes_written < POW2(23)) fprintf(f, "%llu MB\n", (sb->s_kbytes_written + POW2(9)) >> 10); else if (sb->s_kbytes_written < POW2(33)) fprintf(f, "%llu GB\n", (sb->s_kbytes_written + POW2(19)) >> 20); else if (sb->s_kbytes_written < POW2(43)) fprintf(f, "%llu TB\n", (sb->s_kbytes_written + POW2(29)) >> 30); else fprintf(f, "%llu PB\n", (sb->s_kbytes_written + POW2(39)) >> 40); } What we are doing was deliberate, in an effort to display things in a user-friendly fashion that was hopefully still useful. If you'd like to propose something different, please send patches and I'll consider it. > I did test /sys/fs/ext4/sda/lifetime_write_kbytes now, that seems to be > somewhat less bogus, so *that* might actuall be usable for the 24 hours > test. But I wasn't talking about that when I said, that this lifetime > thing is bogus. Bogus is in the eye of the beholder. It's not perfect, and if your system is regularly crashing, then it will be much less perfect. If it's not helpful enough for your use case, don't use it. Certainly if the SSD has information available via S.M.A.R.T., it's better to use that instead. But if a crappy CF device doesn't have S.M.A.R.T, then I'm afraid this is the best that we can offer.... - Ted From bothie at gmx.de Fri Oct 17 23:20:37 2014 From: bothie at gmx.de (Bodo Thiesen) Date: Sat, 18 Oct 2014 01:20:37 +0200 Subject: CF Card wear optimalisation for ext4 In-Reply-To: <20141017154306.GC15062@thunk.org> References: <5435661D.2040905@powercraft.nl> <08F28BC7-34FD-46BA-9D91-CC7D57A4A4D5@dilger.ca> <20141016182555.1f0798df@phenom> <20141016230135.7cbcdbf0@phenom> <20141017154306.GC15062@thunk.org> Message-ID: <20141018012037.1e5c5c63@phenom> * "Theodore Ts'o" hat geschrieben: > On Thu, Oct 16, 2014 at 11:01:35PM +0200, Bodo Thiesen wrote: > > > > Since it never get's updated unless the file system is unmounted, it can > > only be used for a 24 hours test by mounting the file system now, > > unmounting it 24 hours from now and then taking the difference. > > It also gets updated if the file system syncfs(2) or sync(2) system > call. Then sync(1) doesn't call sync(2) ... wait ... # strace sync [...] sync() = 0 [...] hmmm ... So, why didn't the value get updated after writing some GB of data (dd for testing yesterday)? # sync # echo 3 > /proc/sys/vm/drop_caches # tune2fs -l /dev/sda | grep Lifetime Lifetime writes: 2503 GB # uptime 01:11:10 up 2 days, 22:26, 15 users, load average: 0.98, 0.99, 1.02 So, I guess you have to recheck the code for that statement because it really doesn't reflect reality. > But if you crash, any writes since the last syncfs(2), sync(2), > or umount(2) call on the file system can get lost, yes. "Will", not "can". >> Also the value is only available in granularity of 1 GB (plus minus >> 512MB) - at least in my case. > This is what dumpe2fs is currently using: > [printing prefixes] > What we are doing was deliberate, in an effort to display things in a > user-friendly fashion that was hopefully still useful. If you'd like > to propose something different, please send patches and I'll consider it. I quess a simple cmd line option "--raw-values" would be cool not only for dumpe2fs but tune2fs as well - that just switches off this - in fact not *that* bad - default behavior on explicit demand. So, if you're going to include such a patch for some[tm] tools, I'd be happy to do it, just give me some time for that ;) >> I did test /sys/fs/ext4/sda/lifetime_write_kbytes now, that seems to be >> somewhat less bogus, so *that* might actuall be usable for the 24 hours >> test. But I wasn't talking about that when I said, that this lifetime >> thing is bogus. > Bogus is in the eye of the beholder. It's not perfect, and if your > system is regularly crashing, then it will be much less perfect. It was because of the rcu-bug which I didn't know and I did actually blame fglrx for the crashes - which was - as it turned out - not fair. My system is table now. (the uptime of almost 3 days was a deliberate shut down ;) > If it's not helpful enough for your use case, don't use it. Oh, actually, I don't need that value at all, Jelle needs it. But yes, it seems good enough for his use case. But he has to use the sys interface not tune2fs/dumpe2fs for the lifetime writes value. Regars, Bodo