From tpo2 at sourcepole.ch Mon May 25 21:59:44 2015 From: tpo2 at sourcepole.ch (Tomas Pospisek) Date: Mon, 25 May 2015 23:59:44 +0200 Subject: fsck failing to notice that the block device was pulled out from under it? Message-ID: <55639B50.5000002@sourcepole.ch> Hello, tl;dr: it seems like fsck fails to notice when the block device disappears from under it. I have the following setup: * external USB disk * a partition with LUKS in it * ext4 filesystem inside the LUKS block device While doing backups to it I noticed that after some time backups would fail with an error (failed to write, ). In the following log I'm attaching the disk, enabling LUKS and mounting the disk and then starting the backup, which will fail after a while: May 25 11:39:18 hier kernel: [76241.727367] usb 4-1.1: new high-speed USB device number 6 using ehci-pci May 25 11:39:19 hier kernel: [76241.881892] usb 4-1.1: New USB device found, idVendor=0bc2, idProduct=3320 May 25 11:39:19 hier kernel: [76241.881902] usb 4-1.1: New USB device strings: Mfr=2, Product=3, SerialNumber=1 May 25 11:39:19 hier kernel: [76241.881907] usb 4-1.1: Product: Expansion Desk May 25 11:39:19 hier kernel: [76241.881911] usb 4-1.1: Manufacturer: Seagate May 25 11:39:19 hier kernel: [76241.881915] usb 4-1.1: SerialNumber: NA4JDN4N May 25 11:39:19 hier kernel: [76241.882652] usb-storage 4-1.1:1.0: USB Mass Storage device detected May 25 11:39:19 hier kernel: [76241.883007] scsi7 : usb-storage 4-1.1:1.0 May 25 11:39:20 hier kernel: [76242.907527] usb 4-1.1: USB disconnect, device number 6 May 25 11:39:30 hier kernel: [76252.987833] usb 4-1.1: new high-speed USB device number 7 using ehci-pci May 25 11:39:30 hier kernel: [76253.142659] usb 4-1.1: New USB device found, idVendor=0bc2, idProduct=3320 May 25 11:39:30 hier kernel: [76253.142669] usb 4-1.1: New USB device strings: Mfr=2, Product=3, SerialNumber=1 May 25 11:39:30 hier kernel: [76253.142674] usb 4-1.1: Product: Expansion Desk May 25 11:39:30 hier kernel: [76253.142678] usb 4-1.1: Manufacturer: Seagate May 25 11:39:30 hier kernel: [76253.142681] usb 4-1.1: SerialNumber: NA4JDN4N May 25 11:39:30 hier kernel: [76253.143465] usb-storage 4-1.1:1.0: USB Mass Storage device detected May 25 11:39:30 hier kernel: [76253.144285] scsi8 : usb-storage 4-1.1:1.0 May 25 11:39:31 hier kernel: [76254.144816] scsi 8:0:0:0: Direct-Access Seagate Expansion Desk 070B PQ: 0 ANSI: 6 May 25 11:39:31 hier kernel: [76254.145528] sd 8:0:0:0: Attached scsi generic sg1 type 0 May 25 11:39:31 hier kernel: [76254.146193] sd 8:0:0:0: [sdb] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB) May 25 11:39:31 hier kernel: [76254.147465] sd 8:0:0:0: [sdb] Write Protect is off May 25 11:39:31 hier kernel: [76254.147477] sd 8:0:0:0: [sdb] Mode Sense: 43 00 00 00 May 25 11:39:31 hier kernel: [76254.148657] sd 8:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA May 25 11:39:31 hier kernel: [76254.149907] sd 8:0:0:0: [sdb] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB) May 25 11:39:31 hier kernel: [76254.179482] sdb: sdb1 sdb2 May 25 11:39:31 hier kernel: [76254.181592] sd 8:0:0:0: [sdb] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB) May 25 11:39:31 hier kernel: [76254.184058] sd 8:0:0:0: [sdb] Attached SCSI disk May 25 11:39:40 hier kernel: [76263.066639] device-mapper: uevent: version 1.0.3 May 25 11:39:40 hier kernel: [76263.066841] device-mapper: ioctl: 4.27.0-ioctl (2013-10-30) initialised: dm-devel at redhat.com May 25 11:39:40 hier kernel: [76263.706456] NET: Registered protocol family 38 May 25 11:39:41 hier kernel: [76263.976836] sha256_ssse3: Using AVX optimized SHA-256 implementation May 25 11:39:45 hier kernel: [76268.130879] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null) May 25 12:39:32 hier kernel: [79854.051465] usb 4-1.1: USB disconnect, device number 7 May 25 12:39:32 hier kernel: [79854.052460] sd 8:0:0:0: [sdb] Synchronizing SCSI cache May 25 12:39:32 hier kernel: [79854.052634] sd 8:0:0:0: [sdb] May 25 12:39:32 hier kernel: [79854.052636] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK May 25 12:39:32 hier kernel: [79854.250864] usb 4-1.1: new high-speed USB device number 8 using ehci-pci May 25 12:39:32 hier kernel: [79854.405885] usb 4-1.1: New USB device found, idVendor=0bc2, idProduct=3320 May 25 12:39:32 hier kernel: [79854.405895] usb 4-1.1: New USB device strings: Mfr=2, Product=3, SerialNumber=1 May 25 12:39:32 hier kernel: [79854.405900] usb 4-1.1: Product: Expansion Desk May 25 12:39:32 hier kernel: [79854.405904] usb 4-1.1: Manufacturer: Seagate May 25 12:39:32 hier kernel: [79854.405908] usb 4-1.1: SerialNumber: NA4JDN4N May 25 12:39:32 hier kernel: [79854.406710] usb-storage 4-1.1:1.0: USB Mass Storage device detected May 25 12:39:32 hier kernel: [79854.407068] scsi9 : usb-storage 4-1.1:1.0 May 25 12:39:32 hier kernel: [79854.655950] Buffer I/O error on device dm-0, logical block 464117312 May 25 12:39:32 hier kernel: [79854.655959] Buffer I/O error on device dm-0, logical block 464117312 May 25 12:39:33 hier kernel: [79855.407979] scsi 9:0:0:0: Direct-Access Seagate Expansion Desk 070B PQ: 0 ANSI: 6 May 25 12:39:33 hier kernel: [79855.408847] sd 9:0:0:0: Attached scsi generic sg1 type 0 May 25 12:39:33 hier kernel: [79855.410055] sd 9:0:0:0: [sdc] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB) May 25 12:39:33 hier kernel: [79855.411191] sd 9:0:0:0: [sdc] Write Protect is off May 25 12:39:33 hier kernel: [79855.411202] sd 9:0:0:0: [sdc] Mode Sense: 43 00 00 00 May 25 12:39:33 hier kernel: [79855.412402] sd 9:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA May 25 12:39:33 hier kernel: [79855.413245] sd 9:0:0:0: [sdc] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB) May 25 12:39:37 hier kernel: [79858.940027] sdc: sdc1 sdc2 May 25 12:39:37 hier kernel: [79858.942029] sd 9:0:0:0: [sdc] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB) May 25 12:39:37 hier kernel: [79858.944241] sd 9:0:0:0: [sdc] Attached SCSI disk May 25 12:39:51 hier kernel: [79872.773254] Buffer I/O error on device dm-0, logical block 17 May 25 12:39:51 hier kernel: [79872.773259] lost page write due to I/O error on dm-0 May 25 12:39:51 hier kernel: [79872.773295] Buffer I/O error on device dm-0, logical block 68681744 May 25 12:39:51 hier kernel: [79872.773297] lost page write due to I/O error on dm-0 May 25 12:39:51 hier kernel: [79872.773327] Buffer I/O error on device dm-0, logical block 68681774 May 25 12:39:51 hier kernel: [79872.773328] lost page write due to I/O error on dm-0 May 25 12:39:51 hier kernel: [79872.773359] Buffer I/O error on device dm-0, logical block 68681983 May 25 12:39:51 hier kernel: [79872.773360] lost page write due to I/O error on dm-0 May 25 12:39:51 hier kernel: [79872.773390] Buffer I/O error on device dm-0, logical block 68690183 May 25 12:39:51 hier kernel: [79872.773391] lost page write due to I/O error on dm-0 May 25 12:39:51 hier kernel: [79872.773421] Buffer I/O error on device dm-0, logical block 71303343 May 25 12:39:51 hier kernel: [79872.773422] lost page write due to I/O error on dm-0 May 25 12:39:51 hier kernel: [79872.773452] Buffer I/O error on device dm-0, logical block 72876154 May 25 12:39:51 hier kernel: [79872.773453] lost page write due to I/O error on dm-0 May 25 12:39:51 hier kernel: [79872.773483] Buffer I/O error on device dm-0, logical block 77070492 May 25 12:39:51 hier kernel: [79872.773485] lost page write due to I/O error on dm-0 May 25 12:39:51 hier kernel: [79872.773514] Buffer I/O error on device dm-0, logical block 98566225 May 25 12:39:51 hier kernel: [79872.773515] lost page write due to I/O error on dm-0 May 25 12:39:51 hier kernel: [79872.773558] Aborting journal on device dm-0-8. May 25 12:39:51 hier kernel: [79872.773588] Buffer I/O error on device dm-0, logical block 231768064 May 25 12:39:51 hier kernel: [79872.773590] lost page write due to I/O error on dm-0 May 25 12:39:51 hier kernel: [79872.773600] JBD2: Error -5 detected when updating journal superblock for dm-0-8. May 25 12:39:51 hier kernel: [79872.773817] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -19 writing to inode 17173338 (offset 2490368 size 131072 starting block 148009568) May 25 12:39:51 hier kernel: [79872.773821] Buffer I/O error on device dm-0, logical block 148009568 May 25 12:39:51 hier kernel: [79872.773827] Buffer I/O error on device dm-0, logical block 148009569 May 25 12:39:51 hier kernel: [79872.773829] Buffer I/O error on device dm-0, logical block 148009570 May 25 12:39:51 hier kernel: [79872.773830] Buffer I/O error on device dm-0, logical block 148009571 May 25 12:39:51 hier kernel: [79872.773832] Buffer I/O error on device dm-0, logical block 148009572 May 25 12:39:51 hier kernel: [79872.773834] Buffer I/O error on device dm-0, logical block 148009573 May 25 12:39:51 hier kernel: [79872.773836] Buffer I/O error on device dm-0, logical block 148009574 May 25 12:39:51 hier kernel: [79872.773838] Buffer I/O error on device dm-0, logical block 148009575 May 25 12:39:51 hier kernel: [79872.773840] Buffer I/O error on device dm-0, logical block 148009576 May 25 12:39:51 hier kernel: [79872.773841] Buffer I/O error on device dm-0, logical block 148009577 May 25 12:39:51 hier kernel: [79872.773898] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -19 writing to inode 17173338 (offset 2490368 size 131072 starting block 148009598) May 25 12:39:51 hier kernel: [79872.773948] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -19 writing to inode 40770054 (offset 3145728 size 8192 starting block 147922176) May 25 12:39:51 hier kernel: [79872.773975] EXT4-fs warning (device dm-0): ext4_end_bio:317: I/O error -19 writing to inode 40770076 (offset 3887104 size 4096 starting block 5281717) May 25 12:39:51 hier kernel: [79873.230984] EXT4-fs error (device dm-0): ext4_put_super:790: Couldn't clean up the journal May 25 12:39:51 hier kernel: [79873.230990] EXT4-fs (dm-0): Remounting filesystem read-only May 25 12:39:51 hier kernel: [79873.230992] EXT4-fs (dm-0): previous I/O error to superblock detected May 25 12:44:07 hier kernel: [80128.859704] EXT4-fs (dm-0): recovery complete May 25 12:44:07 hier kernel: [80128.878030] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null) Which layer is saying "Buffer I/O error on device dm-0, logical block 0" ? Is it the LUKS module? After one more round of trying to do a backup and failing I decided to fsck the partition (the partition was unlocked via LUKS then): # fsck.ext4 -c -c -k /dev/mapper/luks-446c3a20-eb34-45f6-9ff9-d4a5b17fdedd That gave me after a while of running without error reports a never ending series of fsck wanting me to confirm its actions. Stuff like: die Block-Bitmap (24117248) von Gruppe 736 ist ung?ltig. Zur?cksetzen? ja Since I mount that disk with `gnome-disk` I noticed there, that the lock on the LUKS encrypted parition had in the meantime become *closed* again, without me doing anything. Nevertheless fsck was happily continuing with its disk check. So I think there are a few parts broken in this chain of layers. The one that I can put a finger on is that fsck should notice or should be notified when the block device under it ceases to exist, as is the case when the LUKS device becomes locked again. I'm not sure why fsck doesn't notice. Doesn't it get the right information from the LUKS block device? The end result of this is, that my backups are lost. The disk can still be read, but LUKS is no more able to decipher it. I'm not sure how to go forward from here. I know that I had a series of problems with external USB drives allready: * I've had to have my USB port re-soldered (one year ago) because contact with attached devices had become unreliable - they would disapear and reappear in the kernel log. After having been resoldered, the port was reliable. * ca. 7 months ago I had to throw away an external USB drive because again it would be disappearing in `gnome-disk` after a few minutes of doings backups onto it (same setup as today) The current drive and the current re-soldered port were working reliably since then though. So I'm not sure which parts of the chain are broken now. However I think that: 1. the information of "the device is gone" or "has problems" should be logged by every layer from the bottom up to the top However I am only able to see the USB layer do that: "usb 4-1.1: USB disconnect, device number 7" I can't see any log message from the usb-storage that would let me know that the device has disappeared. Neither can I see LUKS telling me that it is relocking the device (which `gnome-disk` is telling me it did). Nor can I see ext4 telling me that the block device under it is gone. 2. the information of "the device is gone" should be passed from the first layer that has that problem up through all following layers. Why doesn't this apparently happen? What's the way forward to fix this? *t From tytso at mit.edu Tue May 26 01:14:01 2015 From: tytso at mit.edu (Theodore Ts'o) Date: Mon, 25 May 2015 21:14:01 -0400 Subject: fsck failing to notice that the block device was pulled out from under it? In-Reply-To: <55639B50.5000002@sourcepole.ch> References: <55639B50.5000002@sourcepole.ch> Message-ID: <20150526011401.GC16402@thunk.org> On Mon, May 25, 2015 at 11:59:44PM +0200, Tomas Pospisek wrote: > Hello, > > tl;dr: it seems like fsck fails to notice when the block device > disappears from under it. When a block device disappears, reads (and writes) using the file descriptor open on the block device will return errors, and that is how e2fsck "notices". And as fsck is concerned, the only "block device" which is is interacting with is the device mapper node which is exported by the LUKS encrypted device --- and the problem is that the device mapper node is *not* disappearing. > Nevertheless fsck was happily continuing with its disk check. > > So I think there are a few parts broken in this chain of layers. The one > that I can put a finger on is that fsck should notice or should be > notified when the block device under it ceases to exist, as is the case > when the LUKS device becomes locked again. > > I'm not sure why fsck doesn't notice. Doesn't it get the right > information from the LUKS block device? Apparently not. I think you need to complain to the LUKS and device-mapper developers. I will note that some device-mapper nodes are *designed* to hide the fact that one or more of the underlying block device might have disappeared --- for example, in the case of dm_multipath or dm_raid device, you want the exported device-mapper "block device" to survive even if one or more of the underyling constituent block devices have disappeared. That's the whole point of those device-mapper nodes. > The end result of this is, that my backups are lost. The disk can still > be read, but LUKS is no more able to decipher it. So that seems weird. I don't know why LUKS would be corrupting the device just because of a USB disconnect. As I said, the worst that *should* happen is that reads and writes should be returning I/O errors. But this is a LUKS / dm_crypt problem, so you should be raising this question with the device mapper folks. Good luck, - Ted From tpo2 at sourcepole.ch Tue May 26 06:39:24 2015 From: tpo2 at sourcepole.ch (Tomas Pospisek) Date: Tue, 26 May 2015 08:39:24 +0200 Subject: fsck failing to notice that the block device was pulled out from under it? In-Reply-To: <20150526011401.GC16402@thunk.org> References: <55639B50.5000002@sourcepole.ch> <20150526011401.GC16402@thunk.org> Message-ID: <5564151C.30707@sourcepole.ch> Am 26.05.2015 um 03:14 schrieb Theodore Ts'o: > On Mon, May 25, 2015 at 11:59:44PM +0200, Tomas Pospisek wrote: >> Hello, >> >> tl;dr: it seems like fsck fails to notice when the block device >> disappears from under it. > > When a block device disappears, reads (and writes) using the file > descriptor open on the block device will return errors, and that is > how e2fsck "notices". And as fsck is concerned, the only "block > device" which is is interacting with is the device mapper node which > is exported by the LUKS encrypted device --- and the problem is that > the device mapper node is *not* disappearing. > >> Nevertheless fsck was happily continuing with its disk check. >> >> So I think there are a few parts broken in this chain of layers. The one >> that I can put a finger on is that fsck should notice or should be >> notified when the block device under it ceases to exist, as is the case >> when the LUKS device becomes locked again. >> >> I'm not sure why fsck doesn't notice. Doesn't it get the right >> information from the LUKS block device? > > Apparently not. I think you need to complain to the LUKS and > device-mapper developers. I will note that some device-mapper nodes > are *designed* to hide the fact that one or more of the underlying > block device might have disappeared --- for example, in the case of > dm_multipath or dm_raid device, you want the exported device-mapper > "block device" to survive even if one or more of the underyling > constituent block devices have disappeared. That's the whole point of > those device-mapper nodes. > >> The end result of this is, that my backups are lost. The disk can still >> be read, but LUKS is no more able to decipher it. > > So that seems weird. I don't know why LUKS would be corrupting the > device just because of a USB disconnect. As I said, the worst that > *should* happen is that reads and writes should be returning I/O > errors. But this is a LUKS / dm_crypt problem, so you should be > raising this question with the device mapper folks. Thanks a lot for your explanation Ted! One more question if I may. You have in principle already answered that question, however I want to be sure about it. Who is it that is writing this to the kernel log: May 25 12:39:51 hier kernel: [79872.773327] Buffer I/O error on device dm-0, logical block 68681774 May 25 12:39:51 hier kernel: [79872.773328] lost page write due to I/O error on dm-0 is it the layers *below* the ext4 module that are reporting this? Again, thanks a lot for your explanation! *t From tytso at mit.edu Tue May 26 11:43:44 2015 From: tytso at mit.edu (Theodore Ts'o) Date: Tue, 26 May 2015 07:43:44 -0400 Subject: fsck failing to notice that the block device was pulled out from under it? In-Reply-To: <5564151C.30707@sourcepole.ch> References: <55639B50.5000002@sourcepole.ch> <20150526011401.GC16402@thunk.org> <5564151C.30707@sourcepole.ch> Message-ID: <20150526114344.GG16402@thunk.org> On Tue, May 26, 2015 at 08:39:24AM +0200, Tomas Pospisek wrote: > One more question if I may. You have in principle already answered that > question, however I want to be sure about it. Who is it that is writing > this to the kernel log: > > May 25 12:39:51 hier kernel: [79872.773327] Buffer I/O error on > device dm-0, logical block 68681774 > May 25 12:39:51 hier kernel: [79872.773328] lost page write due to > I/O error on dm-0 > > is it the layers *below* the ext4 module that are reporting this? These messages are coming from fs/buffer.c. What component was calling the buffer cache is not evident your log excerpt. Note that "the ext4 module" is different from fsck, which was what you were asking earlier. If the file system was *mounted*, then it was probably be from the file system layer (whether it was mounted using ext3, ext4, vfat, etc.) Assuming modern kernels, reads and writes to the block device (such as from user space programs such as e2fsck) don't end up going through the buffer cache. The bottom line is that if your USB interface is flaky (whether it is caused by a problem in your connector, the USB cable, the USB controller in the hard drive, the host USB etc.) there's not a whole lot that upper layers can do. What should happen though is that when a USB device disconnects and reconnects, it shows up as a new block device. So it should not be automatically reconnected to the LUKS device unless something like gnome-disk is "helpfully" doing this. And even if it is doing that, if the dm-crypt device can't have its key established it *should* have simply refused reads and writes, and not doing something silly like passing the reads and writes through even though it couldn't do the encryption/decryption. Finally, if e2fsck gets an I/O error reading or writing from the block device, it will report it to the user and ask whether or not it should continue. My suggestion is to debug this by breaking it down. Try using an unecrypted file system on a USB stick, and try what happens when yank it out while e2fsck is running. The USB stick should start reporting errors, and then e2fsck will report it and ask whether you want to continue or not. First try it without any GNOME crap running, then try it with GNOME running. Then try what happens when you run something straight forward such as "dd if=/dev/dm-0 of=/dev/null", and then try yanking and removing the USB stick, first unencrypted, and then with LUKS running, and then with LUKS running and with GNOME trying to "help". Each of the layers in the storage stack is independent, so you should be able to isolate each layer and test it in isolation. Regards, - Ted