Bug 231064 - vgapci device memory not excluded on ThunderX
Summary: vgapci device memory not excluded on ThunderX
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: arm64 Any
: --- Affects Only Me
Assignee: Mark Johnston
URL:
Keywords:
Depends on:
Blocks: 228911
  Show dependency treegraph
 
Reported: 2018-08-31 19:45 UTC by Mark Johnston
Modified: 2018-09-15 18:47 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Johnston freebsd_committer 2018-08-31 19:45:41 UTC
I'm testing -ALPHA3 on a packet.net ThunderX.  When I boot GENERIC-NODEBUG, the kernel panics right about the time it gets to the login prompt:

(kgdb) bt
#0  doadump (textdump=0) at /usr/src/sys/kern/kern_shutdown.c:366
#1  0xffff00000018f520 in db_dump (dummy=-281474967580032, dummy2=false, dummy3=-1, dummy4=0xffff00014d3cdb4c "") at /usr/src/sys/ddb/db_command.c:574
#2  0xffff00000018f298 in db_command (last_cmdp=0xffff000001018258 <db_last_command>, cmd_table=0x0, dopager=1) at /usr/src/sys/ddb/db_command.c:481
#3  0xffff00000018edc8 in db_command_loop () at /usr/src/sys/ddb/db_command.c:534
#4  0xffff0000001951e0 in db_trap (type=37, code=0) at /usr/src/sys/ddb/db_main.c:252
#5  0xffff0000007050c0 in kdb_trap (type=37, code=0, tf=0xffff00014d3ce1e0) at /usr/src/sys/kern/subr_kdb.c:693
#6  0xffff000000c8bec8 in data_abort (td=0xfffffd006112f000, frame=0xffff00014d3ce1e0, esr=2516582404, far=16777259, lower=0)
    at /usr/src/sys/arm64/arm64/trap.c:261
#7  0xffff000000c8b858 in do_el1h_sync (td=0xfffffd006112f000, frame=0xffff00014d3ce1e0) at /usr/src/sys/arm64/arm64/trap.c:341
#8  <signal handler called>
#9  0xffff0000008b5280 in in_pcbremlbgrouphash (inp=0xfffffd00e975a9b0) at /usr/src/sys/netinet/in_pcb.c:414
#10 0xffff0000008b504c in in_pcbdrop (inp=0xfffffd00e975a9b0) at /usr/src/sys/netinet/in_pcb.c:1687
#11 0xffff0000009d4eb4 in tcp_close (tp=0xfffffd00e975d3d0) at /usr/src/sys/netinet/tcp_subr.c:1991
#12 0xffff0000009c13c0 in tcp_do_segment (m=0xfffffd0049dfe100, th=0xfffffd0049e6b0a8, so=0xfffffd007bbfd000, tp=0xfffffd00e975d3d0, drop_hdrlen=52, 
    tlen=31, iptos=0 '\000') at /usr/src/sys/netinet/tcp_input.c:2306
#13 0xffff0000009be02c in tcp_input (mp=0xffff00014d3ceff8, offp=0xffff00014d3cefd0, proto=6) at /usr/src/sys/netinet/tcp_input.c:1392
#14 0xffff0000008c203c in ip_input (m=0x0) at /usr/src/sys/netinet/ip_input.c:827
#15 0xffff000000877330 in netisr_dispatch_src (proto=1, source=0, m=0xfffffd0049dfe100) at /usr/src/sys/net/netisr.c:1122
#16 0xffff000000877ac4 in netisr_dispatch (proto=1, m=0xfffffd0049dfe100) at /usr/src/sys/net/netisr.c:1213
#17 0xffff0000008468a0 in ether_demux (ifp=0xfffffd0049a02000, m=0xfffffd0049dfe100) at /usr/src/sys/net/if_ethersubr.c:874
#18 0xffff000000848fbc in ether_input_internal (ifp=0xfffffd0049a02000, m=0xfffffd0049dfe100) at /usr/src/sys/net/if_ethersubr.c:662
#19 0xffff0000008487e0 in ether_nh_input (m=0xfffffd0049dfe100) at /usr/src/sys/net/if_ethersubr.c:692
#20 0xffff000000877330 in netisr_dispatch_src (proto=5, source=0, m=0xfffffd0049dfe100) at /usr/src/sys/net/netisr.c:1122
#21 0xffff000000877ac4 in netisr_dispatch (proto=5, m=0xfffffd0049dfe100) at /usr/src/sys/net/netisr.c:1213
#22 0xffff000000847100 in ether_input (ifp=0xfffffd00498e4800, m=0xfffffd0049dfe100) at /usr/src/sys/net/if_ethersubr.c:782
#23 0xffff0000009c5d6c in tcp_lro_flush (lc=0xffff000149546788, le=0xfffffd000ae25bf0) at /usr/src/sys/netinet/tcp_lro.c:397
#24 0xffff0000009c6c78 in tcp_lro_rx2 (lc=0xffff000149546788, m=0xfffffd0049dfe000, csum=56586, use_hash=1) at /usr/src/sys/netinet/tcp_lro.c:785
#25 0xffff0000009c7414 in tcp_lro_rx (lc=0xffff000149546788, m=0xfffffd0049dfe000, csum=0) at /usr/src/sys/netinet/tcp_lro.c:952
#26 0xffff000000ce1b80 in nicvf_rcv_pkt_handler (nic=0xfffffd00330d1000, cq=0xffff000149547480, cqe_rx=0xffff00016f402800, cqe_type=2)
    at /usr/src/sys/dev/vnic/nicvf_queues.c:678
#27 0xffff000000ce181c in nicvf_cq_intr_handler (nic=0xfffffd00330d1000, cq_idx=4 '\004') at /usr/src/sys/dev/vnic/nicvf_queues.c:774
#28 0xffff000000ce1424 in nicvf_cmp_task (arg=0xffff000149547480, pending=1) at /usr/src/sys/dev/vnic/nicvf_queues.c:887
#29 0xffff00000072817c in taskqueue_run_locked (queue=0xfffffd004b261800) at /usr/src/sys/kern/subr_taskqueue.c:465
#30 0xffff00000072a304 in taskqueue_thread_loop (arg=0xffff000149547500) at /usr/src/sys/kern/subr_taskqueue.c:757
#31 0xffff00000061d680 in fork_exit (callout=0xffff00000072a1a4 <taskqueue_thread_loop>, arg=0xffff000149547500, frame=0xffff00014d3cf960)
    at /usr/src/sys/kern/kern_fork.c:1057
#32 <signal handler called>

Interestingly, the panic does not occur under GENERIC.  It does occur if I recompile GENERIC-NODEBUG with -O0, so I'm able to get a usable kernel dump.  Clearly "grp" is a bogus pointer, but it's not clear where it comes from:

(kgdb) frame 9
#9  0xffff0000008b5280 in in_pcbremlbgrouphash (inp=0xfffffd00e975a9b0) at /usr/src/sys/netinet/in_pcb.c:414                                                 
414                     for (i = 0; i < grp->il_inpcnt; ++i) {
(kgdb) info local
pcbinfo = 0xffff0000e9851820
hdr = 0xffff000148a3bbb0
grp = 0xffffff
i = 0
(kgdb) p *hdr
$1 = {lh_first = 0x0}
Comment 1 Mark Johnston freebsd_committer 2018-09-01 22:14:08 UTC
It looks like the lbgroup hash table is getting corrupted; many of the list heads are equal to 0xffffff00ffffff or 0xffffff.  Nothing on the system actually uses SO_REUSEPORT_LB, so we shouldn't be inserting any hash table entries.  I tried making the hash-table read-only using pmap_protect(), but that doesn't seem to catch the problem - the system still panics the same way.  This plus the fact that the bug is apparently sensitive to memory layout (goes away when compiling with GENERIC or when increasing KSTACK_PAGES) makes it seem like this isn't a generic kernel bug.

That said, there are some bugs in the SO_REUSEPORT_LB implementation:
- Lookups are protected with epoch, but the hash table doesn't use CK_ lists and we don't defer frees of the hash table entries.
- in_pcblbgroup_free() uses the wrong malloc type.
- Lots of style bugs.
Comment 2 Mark Johnston freebsd_committer 2018-09-05 17:59:03 UTC
I discovered that the hash table is "corrupted" immediately after it is allocated and initialized.  In my case, the table is allocated starting at physical address 0x10000000 and is physically contiguous.  It appears that this collides with an address range used by the vgapci device:

      pcib9
          Device Memory:
              0x87e0c0000000-0x87e0c0ffffff
        pci9
            PCI domain 0 bus numbers:
                32
          pcib10
              PCI domain 0 bus numbers:
                  33
              PEM PCIe Memory:
                  0x10000000-0x110fffff
              PEM PCIe IO:
                  0x0-0xfff
            pci10
                pcib10 bus numbers:
                    33
              vgapci0
                  pcib10 memory window:
                      0x10000000-0x10ffffff
                      0x11000000-0x1101ffff

Indeed, this range isn't excluded from the EFI map:

                   Type     Physical      Virtual   #Pages Attr                                                                                                               
    RuntimeServicesData 000000500000       500000 00000800 UC WC WT WB RUNTIME                                                                                                
     ConventionalMemory 000000d00000            0 000ff2ec UC WC WT WB                                                                                                        
    RuntimeServicesData 0000fffec000     fffec000 00000014 UC WC WT WB RUNTIME                                                                                                
     ConventionalMemory 000100000000            0 00ef0100 UC WC WT WB                                                                                                        
       BootServicesData 000ff0100000            0 00000020 UC WC WT WB                                                                                                        
     ConventionalMemory 000ff0120000            0 0000eacc UC WC WT WB                                                                                                        
       BootServicesData 000ffebec000            0 00000514 UC WC WT WB                                                                                                        
     ConventionalMemory 010000400000            0 00fea102 UC WC WT WB                                                                                                        
             LoaderData 010fea502000            0 00008001 UC WC WT WB                                                                                                        
             LoaderCode 010ff2503000            0 00000086 UC WC WT WB                                                                                                                     LoaderData 010ff2589000            0 0000218b UC WC WT WB                                                                                                        
             LoaderCode 010ff4714000            0 00000015 UC WC WT WB                                                                                                        
       BootServicesData 010ff4729000            0 0000965d UC WC WT WB                                                                                                             ConventionalMemory 010ffdd86000            0 000001de UC WC WT WB                                                                                                        
       BootServicesCode 010ffdf64000            0 00000779 UC WC WT WB                                                                                                        
     ConventionalMemory 010ffe6dd000            0 00000040 UC WC WT WB                                                                                                              ACPIReclaimMemory 010ffe71d000            0 0000000b UC WC WT WB                                                                                                        
          ACPIMemoryNVS 010ffe728000            0 00000060 UC WC WT WB                                                                                                        
    RuntimeServicesData 010ffe788000  10ffe788000 00000c30 UC WC WT WB RUNTIME                                                                                                    RuntimeServicesCode 010fff3b8000  10fff3b8000 00000c47 UC WC WT WB RUNTIME                                                                                                
       BootServicesData 010ffffff000            0 00000001 UC WC WT WB                                                                                                        
         MemoryMappedIO 803000000000 803000000000 00001000 UC RUNTIME                                                                                                                  MemoryMappedIO 804000001000 804000001000 00002000 UC RUNTIME                                                                                                         
         MemoryMappedIO 87e006001000 87e006001000 00001000 UC RUNTIME                                                                                                         
         MemoryMappedIO 87e024000000 87e024000000 00001000 UC RUNTIME                                                                                                         
         MemoryMappedIO 87e0d0001000 87e0d0001000 00000001 UC RUNTIME                                                                                                         
         MemoryMappedIO 903000000000 903000000000 00001000 UC RUNTIME                                                                                                         
         MemoryMappedIO 904000001000 904000001000 00002000 UC RUNTIME                                                                                                         
         MemoryMappedIO 97e006001000 97e006001000 00001000 UC RUNTIME                                                                                                         
Physical memory chunk(s):                                                                                                                                                     
  0x00500000 - 0xfff0fffff, 65516 MB (16772096 pages)                                                                                                                         
  0x10000400000 - 0x10ffe71cfff, 65507 MB (16769821 pages)                                                                                                                    
  0x10ffe788000 - 0x10fff3b7fff,    12 MB (   3120 pages)                                                                                                                     
  0x10ffffff000 - 0x10fffffffff,     0 MB (      1 pages)                                                                                                                     
Excluded memory regions:                                                                                                                                                      
  0x00500000 - 0x00cfffff,     8 MB (   2048 pages) NoAlloc                                                                                                                   
  0xfffec000 - 0xffffffff,     0 MB (     20 pages) NoAlloc                                                                                                                   
  0x10fea600000 - 0x10fec38cfff,    29 MB (   7565 pages) NoAlloc                                                                                                             
  0x10ffe71d000 - 0x10fffffefff,    24 MB (   6370 pages) NoAlloc                                                                                                             
  0x803000000000 - 0x803000ffffff,    16 MB (   4096 pages) NoAlloc                                                                                                           
  0x804000001000 - 0x804002000fff,    32 MB (   8192 pages) NoAlloc                                                                                                           
  0x87e006001000 - 0x87e007000fff,    16 MB (   4096 pages) NoAlloc                                                                                                           
  0x87e024000000 - 0x87e024ffffff,    16 MB (   4096 pages) NoAlloc                                                                                                           
  0x87e0d0001000 - 0x87e0d0001fff,     0 MB (      1 pages) NoAlloc                                                                                                           
  0x903000000000 - 0x903000ffffff,    16 MB (   4096 pages) NoAlloc
Comment 3 Andrew Turner freebsd_committer 2018-09-06 12:46:03 UTC
You can try excluding that physical memory from the map by calling arm_physmem_exclude_region from initarm. It takes the start address, length, and the EXFLAG_NOALLOC flag, e.g. arm_physmem_exclude_region(0x10000000, 0x01020000, EXFLAG_NOALLOC).

If you put it just after pmap_bootstrap we can still access the memory via the DMAP to see what data is being put into the range.
Comment 4 Mark Johnston freebsd_committer 2018-09-06 17:35:57 UTC
(In reply to Andrew Turner from comment #3)
Indeed, that works around the panic.  (I actually had to bump MAX_EXCNT for this to work.)  I'm not sure how best to handle this for 12.0 though.
Comment 5 Mark Johnston freebsd_committer 2018-09-06 18:55:42 UTC
It seems that the issue is really just that the EFI framebuffer is included as ConventionalMemory in the EFI map.  It seems to be excluded from the map on the x86 EFI systems that I can easily check.  Anyway, this seems easy enough to work around.
Comment 6 Mark Johnston freebsd_committer 2018-09-07 19:46:46 UTC
https://reviews.freebsd.org/D17073
Comment 7 commit-hook freebsd_committer 2018-09-08 21:52:16 UTC
A commit references this bug:

Author: markj
Date: Sat Sep  8 21:51:47 UTC 2018
New revision: 338537
URL: https://svnweb.freebsd.org/changeset/base/338537

Log:
  Bump MAX_HWCNT and MAX_EXCNT.

  These limits are hit on the ThunderX.  Also make
  arm_physmem_exclude_region() panic rather than fail silently if the
  limit on excluded regions is reached.

  PR:		231064
  Reviewed by:	andrew
  Approved by:	re (kib)
  MFC after:	1 week
  Sponsored by:	The FreeBSD Foundation
  Differential Revision:	https://reviews.freebsd.org/D17073

Changes:
  head/sys/arm/arm/physmem.c
Comment 8 commit-hook freebsd_committer 2018-09-08 21:53:19 UTC
A commit references this bug:

Author: markj
Date: Sat Sep  8 21:52:45 UTC 2018
New revision: 338538
URL: https://svnweb.freebsd.org/changeset/base/338538

Log:
  Exclude the EFI framebuffer from phys_avail[] on arm64.

  On the ThunderX the region occupied by the framebuffer is included in
  the EFI map, so explicitly add it to the set of regions that aren't
  managed by the physical memory allocator.

  PR:		231064
  Reviewed by:	andrew
  Approved by:	re (gjb)
  MFC after:	1 week
  Sponsored by:	The FreeBSD Foundation
  Differential Revision:	https://reviews.freebsd.org/D17073

Changes:
  head/sys/arm64/arm64/machdep.c
Comment 9 commit-hook freebsd_committer 2018-09-15 18:02:59 UTC
A commit references this bug:

Author: markj
Date: Sat Sep 15 18:02:28 UTC 2018
New revision: 338695
URL: https://svnweb.freebsd.org/changeset/base/338695

Log:
  MFC r338538:
  Exclude the EFI framebuffer from phys_avail[] on arm64.

  PR:	231064

Changes:
_U  stable/11/
  stable/11/sys/arm64/arm64/machdep.c
Comment 10 commit-hook freebsd_committer 2018-09-15 18:47:43 UTC
A commit references this bug:

Author: markj
Date: Sat Sep 15 18:47:08 UTC 2018
New revision: 338696
URL: https://svnweb.freebsd.org/changeset/base/338696

Log:
  Revert r338695: it depends on r334032, which was not MFCed.

  PR:	231064

Changes:
_U  stable/11/
  stable/11/sys/arm64/arm64/machdep.c