Bug 208365 - [umass] Cannot fsck 3TB USB disk
Summary: [umass] Cannot fsck 3TB USB disk
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: amd64 Any
: --- Affects Some People
Assignee: Roger Pau Monné
URL:
Keywords: patch
: 208668 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-03-28 18:01 UTC by Poul-Henning Kamp
Modified: 2017-12-01 00:11 UTC (History)
6 users (show)

See Also:


Attachments
Initial patch (5.48 KB, patch)
2016-03-30 12:24 UTC, Roger Pau Monné
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Poul-Henning Kamp freebsd_committer 2016-03-28 18:01:23 UTC
After updating to:

FreeBSD 11.0-CURRENT #4 r296808: Sun Mar 13 22:39:59 UTC 2016

My backup disk:

da0 at umass-sim0 bus 0 scbus13 target 0 lun 0
da0: <WD My Passport 0827 1012> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number 57584431453235394C584137
da0: 400.000MB/s transfers
da0: 2861556MB (732558336 4096 byte sectors)
da0: quirks=0x2<NO_6_BYTE>

Is effectively unavailable because running "fsck_ffs /dev/da0" yields:

usb_pc_common_mem_cb: Page offset was not preserved
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 04 66 2c 30 00 00 10 00 
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command
usb_pc_common_mem_cb: Page offset was not preserved
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 04 66 2c 30 00 00 10 00 
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command
[...]

Previous version was:

FreeBSD 11.0-CURRENT #3 r290949: Tue Nov 17 02:13:28 UTC 2015

That worked OK.

Problem also reproduced with a similar 2TB USB disk.

Also tried plugging the disk into a USB2 port, same thing.
Comment 1 Poul-Henning Kamp freebsd_committer 2016-03-28 18:07:35 UTC
Addendum:

On my laptop, running:

11.0-CURRENT FreeBSD 11.0-CURRENT #32 r296137: Sat Feb 27 11:34:01 UTC 2016

I don't see this problem.
Comment 2 Hans Petter Selasky freebsd_committer 2016-03-29 06:37:59 UTC
Hi PHK,

Looks like someone has changed some bits and pieces in busdma (again). Would you mind to narrow down the exact version causing the breakage? I'll have a look at the commits in the range you've given.

--HPS
Comment 3 Poul-Henning Kamp freebsd_committer 2016-03-29 06:56:26 UTC
It is a production box and I can only mess with it on weekends, so that's going to take forever...

I think the best bet is trying to reproduce on another machine
Comment 4 Hans Petter Selasky freebsd_committer 2016-03-29 08:35:57 UTC
FYI: The SCSI command is an 8Kbyte request. I'll try and see if this is reproducable with "dd".

--HPS
Comment 5 Poul-Henning Kamp freebsd_committer 2016-03-29 08:53:01 UTC
My best guess is that the filesystem was created along these lines:


newfs -O2 -U -b 65536 -f 8192 -i something
Comment 6 Hans Petter Selasky freebsd_committer 2016-03-29 09:11:17 UTC
Does the system have more than 4 GB of RAM, so that bounce pages will be used for BUSDMA?
Comment 7 Poul-Henning Kamp freebsd_committer 2016-03-29 09:20:39 UTC
Dmesg:

CPU: AMD Athlon(tm) II X3 455 Processor (3311.18-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x100f53  Family=0x10  Model=0x5  Stepping=3
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x802009<SSE3,MON,CX16,POPCNT>
  AMD Features=0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!>
  AMD Features2=0x837ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT,NodeId>
  SVM: NP,NRIP,NAsids=64
  TSC: P-state invariant
real memory  = 17179869184 (16384 MB)
avail memory = 16573935616 (15806 MB)
Event timer "LAPIC" quality 400
ACPI APIC Table: <090712 APIC1033>
FreeBSD/SMP: Multiprocessor System Detected: 3 CPUs
FreeBSD/SMP: 1 package(s) x 3 core(s)
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
 cpu2 (AP): APIC ID:  2
ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Gpe0Block: 64/32 (20150818/tbfadt-649)
ioapic0 <Version 2.1> irqs 0-23 on motherboard
ioapic1 <Version 2.1> irqs 24-55 on motherboard
Comment 8 Hans Petter Selasky freebsd_committer 2016-03-29 09:30:56 UTC
Hi,

This issue is not USB related.

My best guess is to try to revert r292255 and see if it makes any difference:

https://svnweb.freebsd.org/changeset/base/292255

--HPS
Comment 9 Poul-Henning Kamp freebsd_committer 2016-03-29 11:09:04 UTC
I can probably try that in the weekend.
Comment 10 Hans Petter Selasky freebsd_committer 2016-03-29 11:30:36 UTC
I'm able to reproduce the problem using the following kernel patch and test-program on the latest and greatest 11-current:

diff --git a/sys/x86/x86/busdma_machdep.c b/sys/x86/x86/busdma_machdep.c
index c96b26d..a4c0af7 100644
--- a/sys/x86/x86/busdma_machdep.c
+++ b/sys/x86/x86/busdma_machdep.c
@@ -97,6 +97,13 @@ bus_dma_run_filter(struct bus_dma_tag_common *tc, bus_addr_t paddr)
 {
        int retval;
 
+       if (tc->flags & BUS_DMA_KEEP_PG_OFFSET) {
+               if ((paddr % 3) == 0) {
+                       printf("F");
+                       return (1);
+               }
+       }
+
        retval = 0;
        do {
                if (((paddr > tc->lowaddr && paddr <= tc->highaddr) ||

cat << EOF > test.c
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>

int main()
{
	int fd = open("/dev/da0", O_RDONLY);
	char buffer[65536];
	int y;

	if (fd < 0)
		return (0);

	y = 65536;
	while (1) {
		(void) malloc(512);
		if (lseek(fd, 0, 0) < 0)
			printf("ERROR seek\n");
		if (read(fd, buffer, y) != y)
			printf("ERROR read\n");
	}
	close(fd);

	return (0);
}
EOF


Memory stick is connected through USB 3.0 port XHCI.

It appears that when bounce pages are mixed with non-bounce pages the problem happens.
Comment 11 Hans Petter Selasky freebsd_committer 2016-03-29 11:55:46 UTC
I just reverted r292255 and the problem does not seem to trigger any more.

Handing issue over to royger @ freebsd . org

--HPS
Comment 12 Roger Pau Monné freebsd_committer 2016-03-30 12:24:06 UTC
Created attachment 168783 [details]
Initial patch

Hello,

Could you please test the attached patch? It should solve the issue, although I would like to hear opinions about it.
Comment 13 Hans Petter Selasky freebsd_committer 2016-03-30 12:33:57 UTC
Hi,

I'll give the patch a spin later today.

--HPS
Comment 14 Hans Petter Selasky freebsd_committer 2016-03-30 12:41:21 UTC
Comments for patch:

I think the "goto again" is not needed. Bouncing from first bounce page and on should be sufficient?

--HPS
Comment 15 Hans Petter Selasky freebsd_committer 2016-03-30 12:47:04 UTC
More comments:

And only iff "vaddr & (PAGE_SIZE-1)" != 0, you need to set the bounce boolean - right, because then you are shifting the data with regard to the page offset.

--HPS
Comment 16 Roger Pau Monné freebsd_committer 2016-03-30 14:00:13 UTC
(In reply to Hans Petter Selasky from comment #15)

Yes, there are several optimizations that can be applied in order to improve the patch. As I've noted in the comment, this is just an initial approach to make sure the problem is caused by not bouncing the whole region (and then the offsets don't match anymore).

Regarding the "again" label, I'm not really sure we can get rid of it, for the mapped case (_bus_dmamap_count_pages) it looks like we can remove it, but for the other cases it depends on whether maxsegsz is a multiple of a PAGE_SIZE, or else we might end up with segments that don't have consecutive offsets AFAICT.
Comment 17 Poul-Henning Kamp freebsd_committer 2016-03-30 14:18:09 UTC
I'm sort of confused about why we would bounce in the first place ?

This is amd64 and a USB3 host controller and device, why would bouncing be necessary ?
Comment 18 Hans Petter Selasky freebsd_committer 2016-03-30 14:32:01 UTC
PHK: If you check with usbconfig that the disk is attached to XHCI and not EHCI, then no bouncing will happen. XHCI is 64-bit while EHCI is 32-bit. There are a few XHCI quirks for 32-bit. See xhci_pci.c

PHK: If your disk was attached to XHCI, then there might be more bugs in the segment list computations, which make them discontiguous.
Comment 19 Poul-Henning Kamp freebsd_committer 2016-03-30 14:39:02 UTC
Ohh, I guess it's one of the quirked controllers.  Then it makes sense.


usbconfig says:

...
ugen0.2: <My Passport 0827 Western Digital> at usbus0, cfg=0 md=HOST spd=SUPER (5.0Gbps) pwr=ON (224mA)

That alone would say XHCI to me.

dmesg further has:

xhci0: <NEC uPD720200 USB 3.0 controller> mem 0xfe7fa000-0xfe7fbfff irq 50 at device 0.0 on pci4
xhci0: 32 bytes context size, 32-bit DMA
xhci0: Unable to map MSI-X table 
usbus0 on xhci0
...
usbus0: 5.0Gbps Super Speed USB v3.0
...
ugen0.2: <Western Digital> at usbus0
umass0: <Western Digital My Passport 0827, class 0/0, rev 3.00/10.12, addr 1> on usbus0
umass0:  SCSI over Bulk-Only; quirks = 0x8000
Comment 20 Hans Petter Selasky freebsd_committer 2016-03-31 14:50:16 UTC
With the attached patch I can no longer reproduce the fault. Maybe others subscribed here could try to reproduce w/ w/o patch aswell.

Thank you!

--HPS
Comment 21 Poul-Henning Kamp freebsd_committer 2016-04-03 06:53:47 UTC
I updated the machine to vanilla:

11.0-CURRENT #0 r297514: Sat Apr  2 21:19:38 UTC 2016

And the problem seems to have gone away.
Comment 22 Hans Petter Selasky freebsd_committer 2016-04-03 07:40:31 UTC
PHK: I doubt the problem has gone away unless a fix has been committed.
Roger: Did you commit a fix in 11-current?

--HPS
Comment 23 Poul-Henning Kamp freebsd_committer 2016-04-03 07:47:59 UTC
As do I, but I don't see it any more.

I guess it may depend on VM fragmentation, so the box simply hasn't run long enough to cause it.
Comment 24 Poul-Henning Kamp freebsd_committer 2016-04-03 13:03:36 UTC
And now the problem is back, so it is still not fixed in trunk.

Will try patch.
Comment 25 Poul-Henning Kamp freebsd_committer 2016-04-04 10:25:20 UTC
Running with the patch and after 8 hours of various pounding things are still working.
Comment 26 Poul-Henning Kamp freebsd_committer 2016-04-06 20:26:15 UTC
I now have uptime of three days, and still no signs of trouble.

Look like that patch solves the issue.
Comment 27 Hans Petter Selasky freebsd_committer 2016-04-10 13:52:31 UTC
*** Bug 208668 has been marked as a duplicate of this bug. ***
Comment 28 Roger Pau Monné freebsd_committer 2016-04-15 19:13:40 UTC
I've reverted the offending commit today, could you give HEAD a try?

Thanks.
Comment 29 Graham Perrin 2016-04-16 07:23:31 UTC
Was the offending commit https://svnweb.freebsd.org/base?view=revision&revision=292255 ?

I ask because in my case the trouble began with an earlier FreeBSD-based PC-BSD, 11.0-CURRENTDEC2015 http://lists.pcbsd.org/pipermail/testing/2015-December/010270.html
Comment 30 Roger Pau Monné freebsd_committer 2016-04-16 07:44:37 UTC
(In reply to Graham Perrin from comment #29)

Yes, the offending commit in this case was r292255.

11.0-CURRENTDEC2015 is AFAICT from a source snapshot of Dec 5 or earlier, which means it doesn't contain r292255. In any case, could you give current HEAD a try?
Comment 31 Poul-Henning Kamp freebsd_committer 2016-04-19 22:14:40 UTC
I have just booted:  FreeBSD 11.0-CURRENT #1 r298283M

Will report back.
Comment 32 Poul-Henning Kamp freebsd_committer 2016-04-21 06:08:07 UTC
No problems so far.
Comment 33 Graham Perrin 2016-04-27 17:25:51 UTC
(In reply to comment #30)

> … could you give current HEAD a try?

Thanks … I'll try PC-BSD 11.0-CURRENTMAY2015, when it arrives. 

(Nothing sooner, sorry; I'm dealing with bereavement.)
Comment 34 Poul-Henning Kamp freebsd_committer 2016-05-05 20:30:13 UTC
still on r298283M, 15 days uptime, no trouble seen.
Comment 35 Graham Perrin 2016-05-18 19:28:43 UTC
I reused the PersonaCrypt device with PC-BSD 11.0-CURRENTMAY2016 on two notebooks with 4 GB memory: an Ergo Vista 621 (circa 2007) and a MacBookPro8,2 (15-inch, early 2011).

I see occasional 'CAM status: CCB request completed with an error' (examples below) but happily, with this month's CURRENT, neither machine grinds to a halt. More specifically, with reference to bug report 208668: 

* I no longer see 'Error 5, Retries exhausted'


$ date ; uptime ; freebsd-version ; uname -a
18 May 2016 at 19:20:43 UTC
 7:20p.m.  up  1:39, 1 users, load averages: 1.52, 1.88, 1.77
11.0-CURRENTMAY2016
FreeBSD cces3-gjp4-pcbsd-macbookpro82 11.0-CURRENTMAY2016 FreeBSD 11.0-CURRENTMAY2016 #19 5bab0d2(master): Fri May  6 17:56:25 UTC 2016     root@devastator:/usr/obj/usr/src/sys/GENERIC  amd64
$ sudo dmesg | grep umass
umass0: <JetFlash TS256MJF2A, class 0/0, rev 1.10/1.00, addr 6> on usbus3
umass0:  SCSI over Bulk-Only; quirks = 0xc100
umass0:2:0: Attached to scbus2
da0 at umass-sim0 bus 0 scbus2 target 0 lun 0
umass0: at uhub7, port 2, addr 6 (disconnected)
da0 at umass-sim0 bus 0 scbus2 target 0 lun 0
(da0:umass-sim0:0:0:0): Periph destroyed
umass0: <Kingston DataTraveler 3.0, class 0/0, rev 2.10/1.00, addr 4> on usbus1
umass0:  SCSI over Bulk-Only; quirks = 0x8100
umass0:2:0: Attached to scbus2
da0 at umass-sim0 bus 0 scbus2 target 0 lun 0
umass0: <Kingston DataTraveler 3.0, class 0/0, rev 2.10/1.00, addr 7> on usbus1
umass0:  SCSI over Bulk-Only; quirks = 0x8100
umass0:2:0: Attached to scbus2
da0 at umass-sim0 bus 0 scbus2 target 0 lun 0
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 02 f4 97 c2 00 00 07 00 
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command
(da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 02 f4 97 c2 00 00 07 00 
(da0:umass-sim0:0:0:0): CAM status: CCB request completed with an error
(da0:umass-sim0:0:0:0): Retrying command
$ 


(In reply to Graham Perrin from comment #29)

> …  in my case the trouble began with an 
> earlier FreeBSD-based PC-BSD, 11.0-CURRENTDEC2015 
> http://lists.pcbsd.org/pipermail/testing/2015-December/010270.html

That remains a puzzle but in my opinion, not enough of a puzzle for this bug report 208365 to remain open. 

Thanks
Comment 36 Mark Linimon freebsd_committer freebsd_triage 2017-12-01 00:11:55 UTC
It appears the problem was resolved around 2016-05-18.