After updating to: FreeBSD 11.0-CURRENT #4 r296808: Sun Mar 13 22:39:59 UTC 2016 My backup disk: da0 at umass-sim0 bus 0 scbus13 target 0 lun 0 da0: <WD My Passport 0827 1012> Fixed Direct Access SPC-4 SCSI device da0: Serial Number 57584431453235394C584137 da0: 400.000MB/s transfers da0: 2861556MB (732558336 4096 byte sectors) da0: quirks=0x2<NO_6_BYTE> Is effectively unavailable because running "fsck_ffs /dev/da0" yields: usb_pc_common_mem_cb: Page offset was not preserved (da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 04 66 2c 30 00 00 10 00 (da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error (da0:umass-sim0:0:0:0): Retrying command usb_pc_common_mem_cb: Page offset was not preserved (da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 04 66 2c 30 00 00 10 00 (da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error (da0:umass-sim0:0:0:0): Retrying command [...] Previous version was: FreeBSD 11.0-CURRENT #3 r290949: Tue Nov 17 02:13:28 UTC 2015 That worked OK. Problem also reproduced with a similar 2TB USB disk. Also tried plugging the disk into a USB2 port, same thing.
Addendum: On my laptop, running: 11.0-CURRENT FreeBSD 11.0-CURRENT #32 r296137: Sat Feb 27 11:34:01 UTC 2016 I don't see this problem.
Hi PHK, Looks like someone has changed some bits and pieces in busdma (again). Would you mind to narrow down the exact version causing the breakage? I'll have a look at the commits in the range you've given. --HPS
It is a production box and I can only mess with it on weekends, so that's going to take forever... I think the best bet is trying to reproduce on another machine
FYI: The SCSI command is an 8Kbyte request. I'll try and see if this is reproducable with "dd". --HPS
My best guess is that the filesystem was created along these lines: newfs -O2 -U -b 65536 -f 8192 -i something
Does the system have more than 4 GB of RAM, so that bounce pages will be used for BUSDMA?
Dmesg: CPU: AMD Athlon(tm) II X3 455 Processor (3311.18-MHz K8-class CPU) Origin="AuthenticAMD" Id=0x100f53 Family=0x10 Model=0x5 Stepping=3 Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT> Features2=0x802009<SSE3,MON,CX16,POPCNT> AMD Features=0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!> AMD Features2=0x837ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT,NodeId> SVM: NP,NRIP,NAsids=64 TSC: P-state invariant real memory = 17179869184 (16384 MB) avail memory = 16573935616 (15806 MB) Event timer "LAPIC" quality 400 ACPI APIC Table: <090712 APIC1033> FreeBSD/SMP: Multiprocessor System Detected: 3 CPUs FreeBSD/SMP: 1 package(s) x 3 core(s) cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 cpu2 (AP): APIC ID: 2 ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Gpe0Block: 64/32 (20150818/tbfadt-649) ioapic0 <Version 2.1> irqs 0-23 on motherboard ioapic1 <Version 2.1> irqs 24-55 on motherboard
Hi, This issue is not USB related. My best guess is to try to revert r292255 and see if it makes any difference: https://svnweb.freebsd.org/changeset/base/292255 --HPS
I can probably try that in the weekend.
I'm able to reproduce the problem using the following kernel patch and test-program on the latest and greatest 11-current: diff --git a/sys/x86/x86/busdma_machdep.c b/sys/x86/x86/busdma_machdep.c index c96b26d..a4c0af7 100644 --- a/sys/x86/x86/busdma_machdep.c +++ b/sys/x86/x86/busdma_machdep.c @@ -97,6 +97,13 @@ bus_dma_run_filter(struct bus_dma_tag_common *tc, bus_addr_t paddr) { int retval; + if (tc->flags & BUS_DMA_KEEP_PG_OFFSET) { + if ((paddr % 3) == 0) { + printf("F"); + return (1); + } + } + retval = 0; do { if (((paddr > tc->lowaddr && paddr <= tc->highaddr) || cat << EOF > test.c #include <stdio.h> #include <fcntl.h> #include <unistd.h> #include <stdlib.h> int main() { int fd = open("/dev/da0", O_RDONLY); char buffer[65536]; int y; if (fd < 0) return (0); y = 65536; while (1) { (void) malloc(512); if (lseek(fd, 0, 0) < 0) printf("ERROR seek\n"); if (read(fd, buffer, y) != y) printf("ERROR read\n"); } close(fd); return (0); } EOF Memory stick is connected through USB 3.0 port XHCI. It appears that when bounce pages are mixed with non-bounce pages the problem happens.
I just reverted r292255 and the problem does not seem to trigger any more. Handing issue over to royger @ freebsd . org --HPS
Created attachment 168783 [details] Initial patch Hello, Could you please test the attached patch? It should solve the issue, although I would like to hear opinions about it.
Hi, I'll give the patch a spin later today. --HPS
Comments for patch: I think the "goto again" is not needed. Bouncing from first bounce page and on should be sufficient? --HPS
More comments: And only iff "vaddr & (PAGE_SIZE-1)" != 0, you need to set the bounce boolean - right, because then you are shifting the data with regard to the page offset. --HPS
(In reply to Hans Petter Selasky from comment #15) Yes, there are several optimizations that can be applied in order to improve the patch. As I've noted in the comment, this is just an initial approach to make sure the problem is caused by not bouncing the whole region (and then the offsets don't match anymore). Regarding the "again" label, I'm not really sure we can get rid of it, for the mapped case (_bus_dmamap_count_pages) it looks like we can remove it, but for the other cases it depends on whether maxsegsz is a multiple of a PAGE_SIZE, or else we might end up with segments that don't have consecutive offsets AFAICT.
I'm sort of confused about why we would bounce in the first place ? This is amd64 and a USB3 host controller and device, why would bouncing be necessary ?
PHK: If you check with usbconfig that the disk is attached to XHCI and not EHCI, then no bouncing will happen. XHCI is 64-bit while EHCI is 32-bit. There are a few XHCI quirks for 32-bit. See xhci_pci.c PHK: If your disk was attached to XHCI, then there might be more bugs in the segment list computations, which make them discontiguous.
Ohh, I guess it's one of the quirked controllers. Then it makes sense. usbconfig says: ... ugen0.2: <My Passport 0827 Western Digital> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (224mA) That alone would say XHCI to me. dmesg further has: xhci0: <NEC uPD720200 USB 3.0 controller> mem 0xfe7fa000-0xfe7fbfff irq 50 at device 0.0 on pci4 xhci0: 32 bytes context size, 32-bit DMA xhci0: Unable to map MSI-X table usbus0 on xhci0 ... usbus0: 5.0Gbps Super Speed USB v3.0 ... ugen0.2: <Western Digital> at usbus0 umass0: <Western Digital My Passport 0827, class 0/0, rev 3.00/10.12, addr 1> on usbus0 umass0: SCSI over Bulk-Only; quirks = 0x8000
With the attached patch I can no longer reproduce the fault. Maybe others subscribed here could try to reproduce w/ w/o patch aswell. Thank you! --HPS
I updated the machine to vanilla: 11.0-CURRENT #0 r297514: Sat Apr 2 21:19:38 UTC 2016 And the problem seems to have gone away.
PHK: I doubt the problem has gone away unless a fix has been committed. Roger: Did you commit a fix in 11-current? --HPS
As do I, but I don't see it any more. I guess it may depend on VM fragmentation, so the box simply hasn't run long enough to cause it.
And now the problem is back, so it is still not fixed in trunk. Will try patch.
Running with the patch and after 8 hours of various pounding things are still working.
I now have uptime of three days, and still no signs of trouble. Look like that patch solves the issue.
*** Bug 208668 has been marked as a duplicate of this bug. ***
I've reverted the offending commit today, could you give HEAD a try? Thanks.
Was the offending commit https://svnweb.freebsd.org/base?view=revision&revision=292255 ? I ask because in my case the trouble began with an earlier FreeBSD-based PC-BSD, 11.0-CURRENTDEC2015 http://lists.pcbsd.org/pipermail/testing/2015-December/010270.html
(In reply to Graham Perrin from comment #29) Yes, the offending commit in this case was r292255. 11.0-CURRENTDEC2015 is AFAICT from a source snapshot of Dec 5 or earlier, which means it doesn't contain r292255. In any case, could you give current HEAD a try?
I have just booted: FreeBSD 11.0-CURRENT #1 r298283M Will report back.
No problems so far.
(In reply to comment #30) > … could you give current HEAD a try? Thanks … I'll try PC-BSD 11.0-CURRENTMAY2015, when it arrives. (Nothing sooner, sorry; I'm dealing with bereavement.)
still on r298283M, 15 days uptime, no trouble seen.
I reused the PersonaCrypt device with PC-BSD 11.0-CURRENTMAY2016 on two notebooks with 4 GB memory: an Ergo Vista 621 (circa 2007) and a MacBookPro8,2 (15-inch, early 2011). I see occasional 'CAM status: CCB request completed with an error' (examples below) but happily, with this month's CURRENT, neither machine grinds to a halt. More specifically, with reference to bug report 208668: * I no longer see 'Error 5, Retries exhausted' $ date ; uptime ; freebsd-version ; uname -a 18 May 2016 at 19:20:43 UTC 7:20p.m. up 1:39, 1 users, load averages: 1.52, 1.88, 1.77 11.0-CURRENTMAY2016 FreeBSD cces3-gjp4-pcbsd-macbookpro82 11.0-CURRENTMAY2016 FreeBSD 11.0-CURRENTMAY2016 #19 5bab0d2(master): Fri May 6 17:56:25 UTC 2016 root@devastator:/usr/obj/usr/src/sys/GENERIC amd64 $ sudo dmesg | grep umass umass0: <JetFlash TS256MJF2A, class 0/0, rev 1.10/1.00, addr 6> on usbus3 umass0: SCSI over Bulk-Only; quirks = 0xc100 umass0:2:0: Attached to scbus2 da0 at umass-sim0 bus 0 scbus2 target 0 lun 0 umass0: at uhub7, port 2, addr 6 (disconnected) da0 at umass-sim0 bus 0 scbus2 target 0 lun 0 (da0:umass-sim0:0:0:0): Periph destroyed umass0: <Kingston DataTraveler 3.0, class 0/0, rev 2.10/1.00, addr 4> on usbus1 umass0: SCSI over Bulk-Only; quirks = 0x8100 umass0:2:0: Attached to scbus2 da0 at umass-sim0 bus 0 scbus2 target 0 lun 0 umass0: <Kingston DataTraveler 3.0, class 0/0, rev 2.10/1.00, addr 7> on usbus1 umass0: SCSI over Bulk-Only; quirks = 0x8100 umass0:2:0: Attached to scbus2 da0 at umass-sim0 bus 0 scbus2 target 0 lun 0 (da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 02 f4 97 c2 00 00 07 00 (da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error (da0:umass-sim0:0:0:0): Retrying command (da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 02 f4 97 c2 00 00 07 00 (da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error (da0:umass-sim0:0:0:0): Retrying command $ (In reply to Graham Perrin from comment #29) > … in my case the trouble began with an > earlier FreeBSD-based PC-BSD, 11.0-CURRENTDEC2015 > http://lists.pcbsd.org/pipermail/testing/2015-December/010270.html That remains a puzzle but in my opinion, not enough of a puzzle for this bug report 208365 to remain open. Thanks
It appears the problem was resolved around 2016-05-18.