<div dir="ltr"><br><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Feb 23, 2017 at 5:52 PM, Alex Williamson <span dir="ltr"><<a target="_blank" href="mailto:alex.williamson@redhat.com">alex.williamson@redhat.com</a>></span> wrote:<br><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><span class="gmail-m_-3793159581251739544gmail-">On Thu, 23 Feb 2017 13:15:54 +0000<br>
Ingrid Ribeiro Galvez <<a target="_blank" href="mailto:inrigalvez@gmail.com">inrigalvez@gmail.com</a>> wrote:<br>
<br>
> Hi guys,<br>
><br>
> I've been working with qemu kvm for a while and now I need to passthrough<br>
> PCI devices. I did all required procedures to make this work: enabled<br>
> iommu, modprobed vfio module, binded device to vfio and checked that vfio<br>
> group was indeed created, etc... But when I start qemu with any pci devices<br>
> I get the error message:<br>
><br>
</span>> *vfio: Failed to read device config space*<br>
<br>
This comes from here:<br>
<br>
/* Get a copy of config space */<br>
ret = pread(vdev->vbasedev.fd, vdev->pdev.config,<br>
MIN(pci_config_size(&vdev->pde<wbr>v), vdev->config_size),<br>
vdev->config_offset);<br>
if (ret < (int)MIN(pci_config_size(&vdev<wbr>->pdev), vdev->config_size)) {<br>
ret = ret < 0 ? -errno : -EFAULT;<br>
error_setg_errno(errp, -ret, "failed to read device config space");<br>
goto error;<br>
}<br>
<br>
So we got fewer bytes than expected and an errno. What's the device<br>
look like on the host (lspci -vvv)? Can you read the full config<br>
space for the device from sysfs<br>
(xxd /sys/bus/pci/devices/0000:01:0<wbr>0.0/config)?<br>
<span class="gmail-m_-3793159581251739544gmail-"><br></span></blockquote><div> </div><div>This is the lspci -vvv on the device:<br><br><span style="font-family:monospace,monospace">01:00.0 Ethernet controller: Intel Corporation Device 157b (rev 03)<br> Subsystem: Intel Corporation Device 0000<br> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+<br> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-<br> Interrupt: pin A routed to IRQ 16<br> Region 0: Memory at dfb00000 (32-bit, non-prefetchable) [disabled] [size=128K]<br> Region 2: I/O ports at e000 [disabled] [size=32]<br> Region 3: Memory at dfb20000 (32-bit, non-prefetchable) [disabled] [size=16K]<br> Capabilities: [40] Power Management version 3<br> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+<wbr>)<br> Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-<br> Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+<br> Address: 0000000000000000 Data: 0000<br> Masking: 00000000 Pending: 00000000<br> Capabilities: [70] MSI-X: Enable- Count=5 Masked-<br> Vector table: BAR=3 offset=00000000<br> PBA: BAR=3 offset=00002000<br> Capabilities: [a0] Express (v2) Endpoint, MSI 00<br> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us<br> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+<br> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-<br> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-<br> MaxPayload 256 bytes, MaxReadReq 512 bytes<br> DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-<br> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <2us, L1 <16us<br> ClockPM- Surprise- LLActRep- BwNot-<br> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+<br> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-<br> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-<br> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+<br> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-<br> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB<br> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-<br> Compliance De-emphasis: -6dB<br> LnkSta2: Current De-emphasis Level: -3.5dB<br> Capabilities: [100 v2] Advanced Error Reporting<br> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-<br> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-<br> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-<br> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-<br> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+<br> AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-<br> Capabilities: [140 v1] Device Serial Number 00-13-f2-ff-ff-a0-01-60<br> Capabilities: [1a0 v1] #17<br> Kernel driver in use: vfio-pci</span><br><br></div><div>And this is the config space I get from sysfs:<br><br><span style="font-family:monospace,monospace">[root@r6 /]# hexdump /sys/bus/pci/devices/0000\:01\:00.0/config <br>0000000 8086 157b 0407 0010 0003 0200 0000 0000<br>0000010 0000 dfb0 0000 0000 e001 0000 0000 dfb2<br>0000020 0000 0000 0000 0000 0000 0000 8086 0000<br>0000030 0000 0000 0040 0000 0000 0000 010b 0000<br>0000040 5001 c823 2008 0000 0000 0000 0000 0000<br>0000050 7005 0180 0000 0000 0000 0000 0000 0000<br>0000060 0000 0000 0000 0000 0000 0000 0000 0000<br>0000070 a011 8004 0003 0000 2003 0000 0000 0000<br>0000080 0000 0000 0000 0000 0000 0000 0000 0000<br>0000090 0000 0000 0000 0000 0000 0000 ffff ffff<br>00000a0 0010 0002 8cc2 1000 283f 0019 5c11 0042<br>00000b0 0040 1011 0000 0000 0000 0000 0000 0000<br>00000c0 0000 0000 001f 0000 0000 0000 0000 0000<br>00000d0 0001 0001 0000 0000 0000 0000 0000 0000<br>00000e0 0000 0000 0000 0000 0000 0000 0000 0000<br>*<br>0000100 0001 1402 0000 0000 0000 0000 2031 0046<br>0000110 0000 0000 2000 0000 00a0 0000 0000 0000<br>0000120 0000 0000 0000 0000 0000 0000 0000 0000<br>*<br>0000140 0003 1a01 0160 ffa0 f2ff 0013 0000 0000<br>0000150 0000 0000 0000 0000 0000 0000 0000 0000<br>*<br>00001a0 0017 0001 0205 0007 0000 0000 0000 0000<br>00001b0 0000 0000 0000 0000 0000 0000 0000 0000<br>*<br>0001000</span><br><br></div><div> </div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><span class="gmail-m_-3793159581251739544gmail-">
> By looking into qemu code I found out that the error was coming from a call<br>
> to pread to read the pci device's file descriptor. It fails with errno<br>
</span>> '*Illegal<br>
> seek*'. Offset being used is 0x70000000000, and this offset seems to be the<br>
<span class="gmail-m_-3793159581251739544gmail-">> same for all devices and also in different machines. I also wrote some code<br>
> to test reading the pci device file descriptor from outside of the qemu<br>
> code and the pread also fails with 'illegal seek' error. This was done on a<br>
> generic linux kernel v4.7.8 compiled with uClibc for an embedded system.<br>
<br>
</span>The offset for each standard region of the device is fixed, PCI config<br>
space is always exposed at the same offset.<br>
<span class="gmail-m_-3793159581251739544gmail-"><br>
> If I install ubuntu 16.04 (kernel v4.4.0) on the same machine and repeat<br>
> the steps, pci passthrough works fine and the pread on my test code also<br>
> works perfectly.<br>
><br>
> This is the code I am using to test reading the device fd with pread:<br>
><br>
><br>
> #include <unistd.h><br>
> #include <stdio.h><br>
> #include <errno.h><br>
> #include <fcntl.h><br>
> #include <linux/vfio.h><br>
> #include <sys/ioctl.h><br>
> #include <sys/mman.h><br>
><br>
> #define BUF_SIZE 4096<br>
<br>
</span>This presumes the device has a full PCIe config space, is the above<br>
sysfs file 4k in size?<br>
<div><div class="gmail-m_-3793159581251739544gmail-h5"><br></div></div></blockquote><div>I used this buffer size because it is what qemu was using. And it works fine on Ubuntu.<br><br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><div><div class="gmail-m_-3793159581251739544gmail-h5">
> int main(){<br>
> char buf[BUF_SIZE], buf1[BUF_SIZE], buf2[BUF_SIZE];<br>
><br>
> int ret,group_fd, fd, fd2;<br>
> size_t nbytes = BUF_SIZE;<br>
> ssize_t bytes_read;<br>
> int iommu1, iommu2;<br>
> unsigned long offset;<br>
> int container, group, device, i;<br>
> struct vfio_group_status group_status = { .argsz = sizeof(group_status)<br>
> };<br>
> struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info)<br>
> };<br>
> struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };<br>
> struct vfio_device_info device_info = { .argsz = sizeof(device_info) };<br>
> struct vfio_region_info reg = { .argsz = sizeof(reg) };<br>
><br>
> container = open("/dev/vfio/vfio",O_RDWR);<br>
> printf("Container = %d\n",container);<br>
> if(ioctl(container,VFIO_GET_A<wbr>PI_VERSION)!=VFIO_API_VERSION)<wbr>{<br>
> printf("Unknown api version: %m\n");<br>
> }<br>
> group_fd = open("/dev/vfio/1",O_RDWR);<br>
> printf("Group fd = %d\n", group_fd);<br>
> ioctl(group_fd, VFIO_GROUP_GET_STATUS, &group_status);<br>
> if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)){<br>
> printf("Group not viable\n");<br>
> getchar();<br>
> return 1;<br>
> }<br>
> ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER,&cont<wbr>ainer);<br>
> ret = ioctl(container,VFIO_SET_IOMMU<wbr>,VFIO_TYPE1_IOMMU);<br>
><br>
> ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);<br>
><br>
> /* Allocate some space and setup a DMA mapping */<br>
> dma_map.vaddr = (unsigned long int) mmap(0, 1024 * 1024, PROT_READ |<br>
> PROT_WRITE,MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);<br>
> dma_map.size = 1024 * 1024;<br>
> dma_map.iova = 0; /* 1MB starting at 0x0 from device view */<br>
> dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;<br>
><br>
> ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);<br>
><br>
> printf("\n\nGETTING DEVICE FD\n");<br>
> fd = ioctl(group_fd,VFIO_GROUP_GET_<wbr>DEVICE_FD,"0000:01:00.0");<br>
><br>
><br>
> ioctl(fd, VFIO_DEVICE_GET_INFO, &device_info);<br>
> for (i = 0; i < device_info.num_regions; i++) {<br>
> reg.index = i;<br>
><br>
> ioctl(fd, VFIO_DEVICE_GET_REGION_INFO, ®);<br>
><br>
> /* Setup mappings... read/write offsets, mmaps<br>
> * For PCI devices, config space is a region */<br>
> }<br>
><br>
> for (i = 0; i < device_info.num_irqs; i++) {<br>
> struct vfio_irq_info irq = { .argsz = sizeof(irq) };<br>
><br>
> irq.index = i;<br>
><br>
> ioctl(fd, VFIO_DEVICE_GET_IRQ_INFO, &irq);<br>
><br>
> }<br>
><br>
><br>
> reg.index = VFIO_PCI_CONFIG_REGION_INDEX;<br>
><br>
> printf("VFIO_DEVICE_GET_REGIO<wbr>N_INFO = %lu",VFIO_DEVICE_GET_REGION_IN<wbr>FO);<br>
> ret = ioctl(fd, VFIO_DEVICE_GET_REGION_INFO, ®);<br>
><br>
> offset = reg.offset;<br>
> printf("offset is %lx\n",offset);<br>
> /*ret = read(group_fd,buf,nbytes);<br>
> printf("Read from group fd, ret is %d: %m\n",ret);<br>
> printf("CONFIG SPACE: \n");<br>
> printf("%s\n",buf);*/<br>
> printf("Fd = %d\n",fd);<br>
><br>
> //printf("VFIO_GROUP_GET_DEV_<wbr>ID = %lu\n",VFIO_GROUP_GET_DEVICE_F<wbr>D);<br>
> ret = read(fd,buf,nbytes);<br>
<br>
</div></div>This reads from offset 0, which is BAR0, which is possibly not enabled<br>
since you haven't enabled I/O or MMIO access to the device in the PCI<br>
COMMAND register in config space. Results here are going to depend on<br>
the state of the device as you receive it, and whether you can even<br>
read 4K from BAR0 space.<br></blockquote><div><br></div><div>How do I enable that? XD <br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">
<span class="gmail-m_-3793159581251739544gmail-"><br>
> printf("Ret from read is = %d, buf = %s\n",ret,buf);<br>
> if(ret<1){<br>
> printf("ERROR: %m \n");<br>
> }<br>
><br>
> ret = pread(fd,buf,nbytes,offset);<br>
<br>
</span>This one should actually read from config space.<br>
<span class="gmail-m_-3793159581251739544gmail-"><br>
> printf("Ret from pread is = %d\n",ret);<br>
> if(ret<1){<br>
> printf("ERROR: %m \n");<br>
> }<br>
<br>
</span>So this is where you get an ESPIPE error? Do different sizes work?<br>
256 bytes? 64 bytes?<br></blockquote><div><br></div><div>No, return of pread is always -1 regardless of the buffer size =/ ... <br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">
<span class="gmail-m_-3793159581251739544gmail-"><br>
> printf("TESTING PREAD ON A COMMON FILE\n");<br>
> fd2 = open("/sys/bus/pci/devices/000<wbr>0:01:00.0/device",O_RDONLY);<br>
> printf("FD2 = %d\n",fd2);<br>
> ret = read(fd2,buf1,nbytes);<br>
> if(ret<0){<br>
> printf("ERROR: %m\n");<br>
> }<br>
> printf("Result from read: ret = %d, content = %s\n",ret,buf1);<br>
> ret = pread(fd2,buf2,nbytes,2);<br>
> if(ret<0){<br>
> printf("ERROR: %m\n");<br>
> }<br>
> printf("Result from pread: ret = %d, content = %s\n",ret,buf2);<br>
<br>
</span>Did these work?<br></blockquote><div><br></div><div>Yes, pread on 'normal' files is working without any problems.<br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">
<span class="gmail-m_-3793159581251739544gmail-"><br>
> close(fd2);<br>
> getchar();<br>
> close(fd);<br>
> close(container);<br>
> close(group_fd);<br>
> return 0;<br>
> }<br>
><br>
><br>
> Something weird I noticed that might be related to this is that on ubuntu<br>
> the iommu groups for some devices are very different from the manually<br>
> compiled kernel. There are a few devices that on ubuntu have a large<br>
> iommu_group while in the generic kernel the iommu group is composed by only<br>
> one device ( and this is in the same machine btw!). Is this normal?<br>
> Other thing I tried was using 0 as offset to pread and this gives me the<br>
> same error even though a normal read works fine....<br>
<br>
</span>The ubuntu kernel is older, perhaps it doesn't include quirks to enable<br>
ACS equivalent isolation on the PCH root ports. That would explain the<br>
group differences. Thanks,<br>
<br>
Alex<br></blockquote><div><br>Please let me know if there is more information I can provide.<br></div><div>Thanks very much!<br><br></div><div>Cheers,<br><br></div><div>Ingrid<br></div></div><br></div></div>