[dm-devel] Barriers still not passing on simple dm devices...

Tue Mar 31 03:39:59 UTC 2009

On Thu, 26 Mar 2009, Jens Axboe wrote:

> On Wed, Mar 25 2009, Mikulas Patocka wrote:
>
> > > > So I think there should be flag (this device does/doesn't support data 
> > > > consistency) that the journaled filesystems can use to mark the disk dirty 
> > > > for fsck. And if you implement this flag, you can accept barriers always 
> > > > to all kind of devices regardless of whether they support consistency. You 
> > > > can then get rid of that -EOPNOTSUPP and simplify filesystem code because 
> > > > they'd no longer need two commit paths and a clumsy way to restart 
> > > > -EOPNOTSUPPed requests.
> > > 
> > > And my point is that this case isn't interesting, because most setups
> > > don't guarantee proper ordering.
> > 
> > If the ordering isn't guaranteed, the filesystem should know about it, and 
> > mark the partition for fsck. That's why I'm suggesting to use a flag for 
> > that. That flag could be also propagated up through md and dm.
> 
> We can do that, not a problem. The problem is that ordering is almost
> never preserved, SCSI does not use ordered tags because it hasn't
> verified that its error path doesn't reorder by mistake. So right now
> you can basically use 'false' as that flag.

There are three ordering guarantees:

1. - nothing (for devices with write cache without cache control)

2. - non-cached ordering: the sequence [submit req a, end req a, submit 
req b, end req b] will make the ordering. It is guaranteed that when the 
request ends successfully, it is on medium. This is what all the 
filesystems, md and dm assume about disks. This consistency model was used 
long way before barriers came in.

3. - barrier ordering: ordering is done with barriers, [submit req a, end 
req a, submit req b, end req b] won't guarantee ordering of a and b, a 
barrier must be inserted.

--- so you can make a two bitflags that differentiate these models. In 
current kernel, model (1) and (2) cannot be differentiated in any way. (3) 
can be differentiated only after a trial write and it won't guarantee that 
(3) will be valid further.

> > The reasoning: "write barriers aren't supported => the device doesn't 
> > guarantee consistency" isn't valid.
> 
> It's valid in the sense that it's the only RELIABLE primitive we have.
> Are you really suggestion that we just assume any device is fully
> ordered, unless proven otherwise?

If someone implements "write barrier's aren't supported => run fsck", then 
a lot of systems start fscking needlessly (for example those using md or 
dm without write cache) and become inoperational for long time because of 
that. So no one can really implement this logic and filesystems don't run 
fsck at all when operated over a device that doesn't support ordering. So 
you get data corruption if you get crash on those devices.

> > > The error handling is complex, no doubt
> > > about that. But the trial barrier test is pretty trivial and even could
> > > be easily abstracted out. If a later barrier write fails, then that's
> > > really no different than if a normal write fails. Error handling is not
> > > easy in that case.
> > 
> > I had a discussion with Andi about it some times ago. The conclusion was 
> > that all the current filesystems handle barriers failing in the middle of 
> > the operation without functionality loss, but it makes barriers useless 
> > for any performance-sensitive tasks (commits that wouldn't block 
> > concurrent activity). Non-blocking commits could only be implemented if 
> > barriers don't fail.
> 
> As long as you do a trial barrier like XFS does, barriers will not fail
> unless you have media error.

No.

The barrier can be cancelled with -EOPNOTSUPP at any time. Andi Kleen 
submitted a patch that implements failing barriers for device mapper and 
he says that md-raid1 does the same thing.

Filesystems handle these randomly failed barriers but the downside is that 
they must not submit any request concurrently with the barrier. Also, that 
-EOPNOTSUPP restarting code is really crap, the request cannot be 
restarted from bi_end_io, so bi_end_io needs to handle to another thread 
for retry without barrier.

See this patch: http://lkml.org/lkml/2008/12/4/433 (and associated thread)
The patch is silly but it just shows what is really hapenning and what the 
filesystem must be prepared to deal with.

> Things would also be much easier, if writes never failed.
>
> -- 
> Jens Axboe

I definitelly agree that it shouldn't fail. So remove that -EOPNOTSUPP 
error code at all, make barriers always pass to all kinds of devices and 
inform the caller via queue flags that the device doesn't support 
ordering.

Mikulas