<div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small">Mr. Assche,<br><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small">The problem with this is that the SCSI commands do not go far enough to actually address the needs of applications that need atomic updates. An application would "like" to be able to update a large arbitrary set, of sectors on a device with atomic semantics. The SCSI commands require the set to be contiguous. Application design starts to get interesting when the contiguous restriction goes away.<br><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small">My initial thoughts are to tag multiple IO requests with an ID and combined length field. This would be compatible with the SCSI spec if the request was contiguous, but nonsensical if the request were multi-segment. On the other hand, just hitting the SCSI spec is probably as simple as adding an "atomic" bit to the current structure so that IO pieces are not cut up. But then, you don't address the multi-segment functionality that is possible. Regardless, there will be issues as pieces of the current stack don't lend themselves well to propagating atomic operations up and down the stack. Just how do you split an atomic write across scsi devices in a raid set anyway?<br><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small">What I am most interested in is keeping the stack working with at least 1:1 mapping layers, and with 1:many layers below the layer that implements the atomic functionality. Think of a dm-atomic.ko device that uses a log internally to implement multi-segment atomic writes. It can talk safely to raid below it, but lvm should be able to sit above it and still have a file system that expects atomic functionality to work. Now getting a SAN connection to work like this would involve a new transport as iSCSI doesn't really have the semantics for this, but maybe there are some extra "transaction ID" bits that could be put into play (it has been a long time since I dug into the depths of the SCSI layers).<br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;font-size:small">Doug<br><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jul 9, 2015 at 9:34 AM, Bart Van Assche <span dir="ltr"><<a href="mailto:bart.vanassche@sandisk.com" target="_blank">bart.vanassche@sandisk.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 07/09/2015 08:41 AM, Doug Dumitru wrote:> Mr. Hellwig,<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">
On Wed, Jul 8, 2015 at 12:38 PM, Christoph Hellwig <<a href="mailto:hch@infradead.org" target="_blank">hch@infradead.org</a><br></span><span class="">
<mailto:<a href="mailto:hch@infradead.org" target="_blank">hch@infradead.org</a>>> wrote:<br>
<br>
On Wed, Jul 08, 2015 at 09:21:21AM -0700, Doug Dumitru wrote:<br>
> I have a "smart" block device that can implement multi-segment atomic<br>
> writes.<br>
<br>
How about submitting your driver upstream first and then we can work<br>
with you on an API that fits the devices and the consumers needs.<br>
<br></span><div><div class="h5">
I usually like to start with an interface and then implement the<br>
driver's from there.<br>
<br>
In this case, this is a block-level interface that supports new<br>
functionality (atomic writes). In the past, you would approach this<br>
type of problem by having the atomic and user layers as a monolithic<br>
solution. Consider database updates and the complexity that they go<br>
through to insure database integrity. If a block device could provide a<br>
database with an atomic update interface, the database would get a lot<br>
simpler. The same discussion holds true for file systems. Depending on<br>
the atomic update implementation, you might end up in the same place in<br>
terms of total code, but you might also end up somewhere completely<br>
different.<br>
<br>
The impetus for this is some research on file system "write<br>
amplification". In general, file system design seems to be heading in<br>
the direction of higher and higher write amplification. For example,<br>
the tree structure of zfs is shockingly inefficient in terms of write<br>
overhead. This is happening at the same time as Flash is becoming<br>
popular but is also moving to smaller and smaller geometries. So write<br>
efficiency is becoming more and more important.<br>
<br>
By decoupling the atomic update semantics from file system and other<br>
block device "users", this gives devices the opportunity to implement<br>
atomic updates internal to or in cooperation with Flash management<br>
algorithms. In theory, you can implement atomic updates without any<br>
extra writes. In practice, some devices will be better than others.<br>
<br>
I was hoping to stumble across someone interested in this as a concept,<br>
or someone who has researched this area, as I don't have any near<br>
production existing code. I could pretty easily hack in a couple of<br>
extra fields in struct bio that would accomplish what I see, but others<br>
might have differing input.<br>
</div></div></blockquote>
<br>
Hello Doug,<br>
<br>
When designing such an API, please try to stay close to the semantics of the already standardized SCSI commands. As you probably know the Linux SCSI core has been implemented as a block driver. Any new command that is added to the Linux block layer has to be translated by the Linux SCSI core into a SCSI command. An example of a patch series that adds support for a new block layer primitive is the patch series that adds compare-and-write support (<a href="http://thread.gmane.org/gmane.linux.scsi/95869" rel="noreferrer" target="_blank">http://thread.gmane.org/gmane.linux.scsi/95869</a>). Although that patch series is not yet upstream I think it is a good example of how to add new functionality to the block layer and SCSI core.<span class="HOEnZb"><font color="#888888"><br>
<br>
Bart.<br>
</font></span></blockquote></div><br><br clear="all"><br>-- <br><div class="gmail_signature">Doug Dumitru<br>EasyCo LLC<br></div>
</div>