[This may be related to bugzilla 224080 that is also for -r326347 problems.] See Emmanuel Vadot's pine64 USB-breakage report at: https://lists.freebsd.org/pipermail/svn-src-head/2017-December/106954.html He writes in part: This seems to break USB (EHCI and OHCI) on Pine64 (arm64) The board just hang when plugin usb device In my case I saw this as a boot problem because I use a USB SSD on a powered hub for the root file system. (And I did not know about the above report yet.) See my list messages for the rpi2 example: https://lists.freebsd.org/pipermail/freebsd-current/2017-December/067771.html https://lists.freebsd.org/pipermail/freebsd-current/2017-December/067799.html https://lists.freebsd.org/pipermail/freebsd-current/2017-December/067807.html https://lists.freebsd.org/pipermail/freebsd-current/2017-December/067811.html My reports have some material from db> prompt commands (but I'm no expert at what to pick to show for this). In my case I did a bisect sequence to find the -r326346-and-before good / -r325347-and-later bad boundary: I had jumped from -r326192 to -r326726 but had the problem. Some db> material is from -r326726 before the bisect and other material is from -r326347 . The kernel thread (100001) involves: . . . uma_zcreate() at uma_zcreate+0x10c pc = 0xc052b024 lr = 0xc050cc90 (ffs_mount+0x80) sp = 0xd42a5858 fp = 0xd42a5980 r4 = 0xd7844000 r5 = 0x00000000 r6 = 0x00000003 r7 = 0xc09a99c0 r8 = 0x00000000 r9 = 0xd429f000 r10 = 0xd828c400 ffs_mount() at ffs_mount+0x80 pc = 0xc050cc90 lr = 0xc032e1d4 (vfs_donmount+0xeec) sp = 0xd42a5988 fp = 0xd42a5b38 r4 = 0xffffffff r5 = 0xd74ec120 r6 = 0xd429f000 r7 = 0x00000000 r8 = 0x00000000 r9 = 0xc425acd0 r10 = 0xd828c400 . . . The uma thread involves: . . . uma_reclaim_locked() at uma_reclaim_locked+0x200 pc = 0xc052d9f8 lr = 0xc052de70 (uma_reclaim_worker+0x4c) sp = 0xd87dadf0 fp = 0xd87dae20 r4 = 0xc09a9c68 r5 = 0xc4242d80 r6 = 0xc06df86e r7 = 0x00000000 r8 = 0x00000100 r9 = 0xc09b3d08 r10 = 0xc09a9c7c uma_reclaim_worker() at uma_reclaim_worker+0x4c pc = 0xc052de70 lr = 0xc0230f10 (fork_exit+0xa0) sp = 0xd87dae28 fp = 0xd87dae40 r4 = 0xd7ad8740 r5 = 0xd70ce000 r6 = 0xc052de24 r7 = 0x00000000 r8 = 0xd87dae48 r9 = 0xd7ada3a0 r10 = 0xc0824e8c . . .
I currently use zfs on USB <-> IDE and USB <-> SATA as well (on amd64). No problems there however I also have UFS on USB <-> SATA. I'll check that out tonight.
(In reply to Mark Millard from comment #0) I should have noted: The rpi2 is a v1.1 one, a Cortex-A7, armv7 context. Pine64 has a Cortex-A53, aarch64 context. (rpi2 v1.2 would be Cortext-A53, thus the need for the note even for folks familiar with the boards.)
Over to committer of r326347 for review.
Can someone who is experiencing a hang give me the following values from ddb: uma_kmem_limit uma_kmem_total uma_recliam_needed These are also indirectly visible via: sysctl vm.kmem_map_free sysctl vm.kmem_map_size I have not noticed anything irregular in the arm usage of kmem after a review. So we'll need to debug a bit further. Thank you for your patience.
mi_switch() at mi_switch+0x258 pc = 0xc0275f28 lr = 0xc052d9f8 (uma_reclaim_locked+0x200) Can someone on arm report what source line is at this offset? You can use addr2line or gdb list *(uma_reclaim_locked+0x200) also show vmem kernel_arena
(In reply to Jeff Roberson from comment #5) This reply is for things I can do without the rpi2 at the moment. . . addr2line (on the cross build host since lack of booting the rpi2 is part of my context): I show both the pc and the lr addresses in: pc = 0xc0275f28 lr = 0xc052d9f8 (uma_reclaim_locked+0x200) # addr2line -e /usr/obj/DESTDIRs/clang-armv7dbg-installkernel_bisect/usr/lib/debug/boot/kernel/kernel.debug 0xc0275f28 /usr/src_bisect/sys/kern/kern_synch.c:447 So: head/sys/kern/kern_synch.c 326023 in mi_switch on the if line below: /* * If the last thread was exiting, finish cleaning it up. */ if ((td = PCPU_GET(deadthread))) { PCPU_SET(deadthread, NULL); thread_stash(td); } # addr2line -e /usr/obj/DESTDIRs/clang-armv7dbg-installkernel_bisect/usr/lib/debug/boot/kernel/kernel.debug 0xc052d9f8 ./machine/pcpu.h:96 So interpreting that as /usr/src_bisect/sys/arm/include/pcpu.h in the source tree used for the build: head/sys/arm/include/pcpu.h 326258 in the __asm line in: static inline struct thread * get_curthread(void) { void *ret; __asm __volatile("mrc p15, 0, %0, c13, c0, 4" : "=r" (ret)); return (ret); } When I'm using the rpi2 at a db> prompt I'll try to confirm that the mrc is what I see.
Can you gradually reduce the offset until it gives a location in actual uma code?
(In reply to Jeff Roberson from comment #5) [I'll see in a bit about finding a uma code address.] show vmem kernel_arena crashes after a few lines: db> show vmem kernel_arena vmem 0xc07eb71c ',' quantum: 1 size: 0 inuse: 3230964176 free: 1064003120 busy tags: -1072951294 free tags: 2147482985 KDB: reentering KDB: stack backtrace: db_trace_self() at db_trace_self pc = 0xc05622b8 lr = 0xc005f0d4 (db_trace_self_wrapper+0x30) sp = 0xd42b2278 fp = 0xd42b2390 db_trace_self_wrapper() at db_trace_self_wrapper+0x30 pc = 0xc005f0d4 lr = 0xc02b2e1c (kdb_reenter+0x44) sp = 0xd42b2398 fp = 0xd42b23a0 r4 = 0xc0943e28 r5 = 0x5155005d r6 = 0x00000005 r7 = 0x00000005 kdb_reenter() at kdb_reenter+0x44 pc = 0xc02b2e1c lr = 0xc0585d4c (abort_handler+0x110) sp = 0xd42b23a8 fp = 0xd42b2438 r4 = 0x00000005 r10 = 0xd42b2440 abort_handler() at abort_handler+0x110 pc = 0xc0585d4c lr = 0xc0564cfc (exception_exit) sp = 0xd42b2440 fp = 0xd42b2898 r4 = 0xc07eb71c r5 = 0xd42b2790 r6 = 0x51550049 r7 = 0xd42b25c0 r8 = 0xd42b26a8 r9 = 0xd42b24d8 r10 = 0x00000000 exception_exit() at exception_exit pc = 0xc0564cfc lr = 0xc02cfe50 (vmem_summ+0xd8) sp = 0xd42b24d0 fp = 0xd42b2898 r0 = 0xc07eb840 r1 = 0xd42b25c0 r2 = 0xd42b2790 r3 = 0x51550049 r4 = 0xc07eb71c r5 = 0xd42b2790 r6 = 0x51550049 r7 = 0xd42b25c0 r8 = 0xd42b26a8 r9 = 0xd42b24d8 r10 = 0x00000000 r12 = 0xd42b25c0 vmem_summ() at vmem_summ+0xf8 pc = 0xc02cfe70 lr = 0xc005bb34 (db_command+0x268) sp = 0xd42b28a0 fp = 0xd42b2940 r4 = 0x00000001 r5 = 0x00000001 r6 = 0xc069da80 r7 = 0xc02cfd78 r8 = 0xc09b02b8 r9 = 0xc07eb7ec r10 = 0x00000000 db_command() at db_command+0x268 pc = 0xc005bb34 lr = 0xc005b8bc (db_command_loop+0x74) sp = 0xd42b2948 fp = 0xd42b2958 r4 = 0xc067b689 r5 = 0xc06ad511 r6 = 0xc09b02a4 r7 = 0xd42b2b28 r8 = 0xc0824ea0 r9 = 0xc07b3ee8 r10 = 0xc0943e10 db_command_loop() at db_command_loop+0x74 pc = 0xc005b8bc lr = 0xc005f25c (db_trap+0x12c) sp = 0xd42b2960 fp = 0xd42b2a78 r4 = 0x00000000 r5 = 0xc09b02b0 r6 = 0xc0943e30 r10 = 0xc0943e10 . . .
Can you get me the uma_kmem_* variables? I think you actually need to do: x/x kernel_arena to extract the address to pass to show vmem.
(In reply to Jeff Roberson from comment #4) db> print *uma_kmem_limit 10bde000 db> print *uma_kmem_total e05000 db> print *uma_reclaim_needed 7
show page ?
(In reply to Jeff Roberson from comment #9) As for kernel_arena . . . db> x/x kernel_arena kernel_arena: c0948e40 db> show vmem 0xc0948e40 vmem 0xc0948e40 'kernel arena' quantum: 4096 size: 362807296 inuse: 362754048 free: 53248 busy tags: 1966 free tags: 3 inuse size free size 4096 1802 7380992 0 0 8192 14 114688 0 0 12288 85 1044480 0 0 16384 15 245760 0 0 20480 2 40960 0 0 32768 5 163840 0 0 36864 1 36864 0 0 53248 0 0 1 53248 65536 11 720896 0 0 94208 1 94208 0 0 131072 8 1134592 0 0 262144 13 4153344 0 0 524288 1 888832 0 0 2097152 2 6844416 0 0 4194304 2 12742656 0 0 8388608 1 10035200 0 0 16777216 2 52510720 0 0 134217728 1 264601600 0 0 But it turns out that *kernel_arena works: db> show vmem *kernel_arena vmem 0xc0948e40 'kernel arena' quantum: 4096 size: 362807296 inuse: 362754048 free: 53248 busy tags: 1966 free tags: 3 inuse size free size 4096 1802 7380992 0 0 8192 14 114688 0 0 12288 85 1044480 0 0 16384 15 245760 0 0 20480 2 40960 0 0 32768 5 163840 0 0 36864 1 36864 0 0 53248 0 0 1 53248 65536 11 720896 0 0 94208 1 94208 0 0 131072 8 1134592 0 0 262144 13 4153344 0 0 524288 1 888832 0 0 2097152 2 6844416 0 0 4194304 2 12742656 0 0 8388608 1 10035200 0 0 16777216 2 52510720 0 0 134217728 1 264601600 0 0 db> show vmem *kernel_arena vmem 0xc0948e40 'kernel arena' quantum: 4096 size: 362807296 inuse: 362754048 free: 53248 busy tags: 1966 free tags: 3 inuse size free size 4096 1802 7380992 0 0 8192 14 114688 0 0 12288 85 1044480 0 0 16384 15 245760 0 0 20480 2 40960 0 0 32768 5 163840 0 0 36864 1 36864 0 0 53248 0 0 1 53248 65536 11 720896 0 0 94208 1 94208 0 0 131072 8 1134592 0 0 262144 13 4153344 0 0 524288 1 888832 0 0 2097152 2 6844416 0 0 4194304 2 12742656 0 0 8388608 1 10035200 0 0 16777216 2 52510720 0 0 134217728 1 264601600 0 0
(In reply to Jeff Roberson from comment #11) db> show page vm_cnt.v_free_count: 201160 vm_cnt.v_inactive_count: 0 vm_cnt.v_active_count: 0 vm_cnt.v_laundry_count: 0 vm_cnt.v_wire_count: 4563 vm_cnt.v_free_reserved: 333 vm_cnt.v_free_min: 1360 vm_cnt.v_free_target: 4441
Can you confirm that you have r326664?
Is there a way I can get remote access to a serial console on this board for interactive debugging? Or can I buy one cheaply somewhere?
(In reply to Jeff Roberson from comment #5) I have confirmed: db> x/i 0xc052d9f8 uma_reclaim_locked+0x200: mrc p15, 0, r0, c13, c0, 4 But near by addresses show line numbers in blank/comment lines before the while loop for: /* * Drain the cached buckets from a zone. Expects a locked zone on entry. */ static void bucket_cache_drain(uma_zone_t zone) { uma_bucket_t bucket; /* * Drain the bucket queues and free the buckets, we just keep two per * cpu (alloc/free). */ while ((bucket = LIST_FIRST(&zone->uz_buckets)) != NULL) { LIST_REMOVE(bucket, ub_link); ZONE_UNLOCK(zone); bucket_drain(zone, bucket); bucket_free(zone, bucket, NULL); ZONE_LOCK(zone); } /* * Shrink further bucket sizes. Price of single zone lock collision * is probably lower then price of global cache drain. */ if (zone->uz_count > zone->uz_count_min) zone->uz_count--; } in /usr/src_bisect/sys/vm/uma_core.c . So for reference to see if the near by code makes sense: there is a +0x274: bl uma_zfree_arg See below. db> x/i,20 uma_reclaim_locked+0x1d8: beq uma_reclaim_locked+0x4c8 uma_reclaim_locked+0x1dc: mrc p15, 0, r0, c13, c0, 4 uma_reclaim_locked+0x1e0: mov r1, #0x00000000 uma_reclaim_locked+0x1e4: mov r2, r10 uma_reclaim_locked+0x1e8: und e3003317 uma_reclaim_locked+0x1ec: bl _thread_lock uma_reclaim_locked+0x1f0: mrc p15, 0, r0, c13, c0, 4 uma_reclaim_locked+0x1f4: mov r1, r4 uma_reclaim_locked+0x1f8: str r4, [r13, #0x004] uma_reclaim_locked+0x1fc: bl sched_bind uma_reclaim_locked+0x200: mrc p15, 0, r0, c13, c0, 4 uma_reclaim_locked+0x204: mov r1, #0x00000000 uma_reclaim_locked+0x208: mov r2, r10 uma_reclaim_locked+0x20c: und e3003319 uma_reclaim_locked+0x210: ldr r0, [r0] uma_reclaim_locked+0x214: add r0, r0, #0x00000010 uma_reclaim_locked+0x218: bl __mtx_unlock_spin_flags uma_reclaim_locked+0x21c: mov r0, r9 uma_reclaim_locked+0x220: mov r1, r10 uma_reclaim_locked+0x224: und e30026cd uma_reclaim_locked+0x228: bl __rw_rlock uma_reclaim_locked+0x22c: und e3090cb8 uma_reclaim_locked+0x230: cmp r12, #0x0000009a uma_reclaim_locked+0x234: ldr r0, [r0] uma_reclaim_locked+0x238: cmp r0, #0x00000000 uma_reclaim_locked+0x23c: bne uma_reclaim_locked+0x254 uma_reclaim_locked+0x240: b uma_reclaim_locked+0x4b0 uma_reclaim_locked+0x244: ldr r0, [r13, #0x008] uma_reclaim_locked+0x248: ldr r0, [r0, #0x0a4] uma_reclaim_locked+0x24c: cmp r0, #0x00000000 uma_reclaim_locked+0x250: beq uma_reclaim_locked+0x4b0 uma_reclaim_locked+0x254: str r0, [r13, #0x008] db> x/i,20 uma_reclaim_locked+0x258: ldr r7, [r0, #0x04c] uma_reclaim_locked+0x25c: cmp r7, #0x00000000 uma_reclaim_locked+0x260: bne uma_reclaim_locked+0x2a8 uma_reclaim_locked+0x264: b uma_reclaim_locked+0x244 uma_reclaim_locked+0x268: sub r0, r0, #0x00000010 uma_reclaim_locked+0x26c: ldr r0, [r0] uma_reclaim_locked+0x270: mov r1, r9 uma_reclaim_locked+0x274: bl uma_zfree_arg uma_reclaim_locked+0x278: b uma_reclaim_locked+0x29c uma_reclaim_locked+0x27c: und e30f0df4 uma_reclaim_locked+0x280: cmp r12, #0x0000006d uma_reclaim_locked+0x284: bl kassert_panic uma_reclaim_locked+0x288: b uma_reclaim_locked+0x400 uma_reclaim_locked+0x28c: und e30f0df4 uma_reclaim_locked+0x290: cmp r12, #0x0000006d uma_reclaim_locked+0x294: bl kassert_panic uma_reclaim_locked+0x298: b uma_reclaim_locked+0x468 uma_reclaim_locked+0x29c: ldr r7, [r7, #0x048] uma_reclaim_locked+0x2a0: cmp r7, #0x00000000 uma_reclaim_locked+0x2a4: beq uma_reclaim_locked+0x244 uma_reclaim_locked+0x2a8: ldrb r0, [r7, #0x087] uma_reclaim_locked+0x2ac: tst r0, #0x00000020 uma_reclaim_locked+0x2b0: bne uma_reclaim_locked+0x29c uma_reclaim_locked+0x2b4: ldr r0, [r7, #0x040] uma_reclaim_locked+0x2b8: mov r1, #0x00000000 uma_reclaim_locked+0x2bc: mov r2, r10 uma_reclaim_locked+0x2c0: und e30032e7 uma_reclaim_locked+0x2c4: add r0, r0, #0x00000010 uma_reclaim_locked+0x2c8: mov r9, #0x00000000 uma_reclaim_locked+0x2cc: bl __mtx_lock_flags uma_reclaim_locked+0x2d0: bl critical_enter uma_reclaim_locked+0x2d4: mrc p15, 0, r0, c0, c0, 5
(In reply to Jeff Roberson from comment #14) This is running -r326347 directly, not later. It is a debug kernel build (unusual for me). If needed I can update to some specific version of the kernel after I build it. Do you want -r326664 on my end of things? It is built from my sources that are different mostly by powerpc64 and powerpc experiments. My kernel configurations include the standard one and then adjust the debug vs. not details. build with -mcpu=cortex-a7 in use. # svnlite status /usr/src_bisect/ | sort ? /usr/src_bisect/sys/amd64/conf/GENERIC-DBG ? /usr/src_bisect/sys/amd64/conf/GENERIC-NODBG ? /usr/src_bisect/sys/arm/conf/GENERIC-DBG ? /usr/src_bisect/sys/arm/conf/GENERIC-NODBG ? /usr/src_bisect/sys/arm64/conf/GENERIC-DBG ? /usr/src_bisect/sys/arm64/conf/GENERIC-NODBG ? /usr/src_bisect/sys/powerpc/conf/GENERIC64vtsc-DBG ? /usr/src_bisect/sys/powerpc/conf/GENERIC64vtsc-NODBG ? /usr/src_bisect/sys/powerpc/conf/GENERICvtsc-DBG ? /usr/src_bisect/sys/powerpc/conf/GENERICvtsc-NODBG M /usr/src_bisect/contrib/llvm/lib/Target/PowerPC/PPCFrameLowering.cpp M /usr/src_bisect/contrib/llvm/tools/lld/ELF/Arch/PPC64.cpp M /usr/src_bisect/crypto/openssl/crypto/armcap.c M /usr/src_bisect/lib/libkvm/kvm_powerpc.c M /usr/src_bisect/lib/libkvm/kvm_private.c M /usr/src_bisect/stand/defs.mk M /usr/src_bisect/stand/powerpc/boot1.chrp/Makefile M /usr/src_bisect/stand/powerpc/kboot/Makefile M /usr/src_bisect/sys/arm64/arm64/identcpu.c M /usr/src_bisect/sys/conf/kmod.mk M /usr/src_bisect/sys/conf/ldscript.powerpc M /usr/src_bisect/sys/kern/subr_pcpu.c M /usr/src_bisect/sys/powerpc/aim/mmu_oea64.c M /usr/src_bisect/sys/powerpc/ofw/ofw_machdep.c M /usr/src_bisect/sys/powerpc/powerpc/interrupt.c M /usr/src_bisect/sys/powerpc/powerpc/mp_machdep.c M /usr/src_bisect/sys/powerpc/powerpc/trap.c /usr/src_bisect/sys/kern/subr_pcpu.c just has a comment about pcpu_find in it (powerpc tied content).
(In reply to Jeff Roberson from comment #15) No remote access to the serial console, sorry. Cortext-A7 (armv7) based rpi2's (so, V1.1 or before) are no longer in production. [V1.2 changed to Cortex-A53 (aarch64).] The one I'm using has its slot for mmcsd0 with a broken "hold the sdcard in" mechanism but an eject mechanism that still works. I actively hold the sdcard in until it is no longer needed (has switched to the USB SSD on the powered hub). I tried to get my hands on an advertised Cortex-A7 RPI2-B but what was delivered was a V1.2 (Cortex-A53) one. This may be typical these days. So I'm still stuck holding the sdcard in during early power-up. However, based on Emmanuel Vadot's pine64 report I expect that lots of other types of devices also show the problem for USB drives. So far I've avoided bringing forward the problem to any other types of device. Out of (from snapshots): o 12.0-CURRENT arm GUMSTIX o 12.0-CURRENT armv6 RPI-B o 12.0-CURRENT armv7 BEAGLEBONE o 12.0-CURRENT armv7 CUBIEBOARD o 12.0-CURRENT armv7 CUBIEBOARD2 o 12.0-CURRENT armv7 CUBOX-HUMMINGBOARD o 12.0-CURRENT armv7 RPI2 o 12.0-CURRENT armv7 PANDABOARD o 12.0-CURRENT armv7 WANDBOARD o 12.0-CURRENT aarch64 GENERIC o 12.0-CURRENT aarch64 RPI3 o 12.0-CURRENT aarch64 PINE64 I have access to one each of the last 2 types as well and could build for them if needed. I also have that RPI2 V1.2 that was not what was advertised, which may well be covered by the RPI3 type of build. You might want to talk to Emmanuel Vadot since he is a committer and works on supporting these types of boards. He knows vastly more than I do. He may have access to more board types as well. RPI3's are fairly cheap and are easy to find/order. Pine64+ 2GB's have more RAM and are faster but not as common. There are Pine64's with less RAM and more modern variants that Emmanuel has provided support for as I understand (lpddr3 memory instead, for example). [The banana Pi M3 (Cortex-A7) that I have access to seems to have a power-connector problem so I've not mentioned it above. Still I could try dealing with it if needed.]
(In reply to Mark Millard from comment #17) Actually I need to be clearer about versions: kernel: -r326347 world: -r326726 This started for me when I jumped from -r326192 to -r326726. I held world constant and bisect'd just the kernel, using a separate source tree. So my world is from my /usr/src/ and my kernel is from my /usr/src-bisect/ . I could make both world and kernel be based on a specific version if you want --or just the kernel as I have been doing. Also present are materials from: # ls -lTd /usr/local/share/rpi-firmware /usr/local/share/u-boot/u-boot-rpi2 drwxr-xr-x 2 root wheel 512 Nov 22 17:22:52 2017 /usr/local/share/rpi-firmware drwxr-xr-x 2 root wheel 512 Nov 22 17:22:52 2017 /usr/local/share/u-boot/u-boot-rpi2 from sysutils/rpi-firmware and sysutils/u-boot-rpi2 .
Please pull in the patch from r326664 and retest. It is safe to pull just that rev.
(In reply to Mark Millard from comment #16) Why the source lines were odd. . . Stupid /usr/src_bisect/ vintage mistake on my part: it was at -r326346 instead of -r326347 (from something prior that I'd looked at). Updating /usr/src_bisect back to -r326347 leads to the addr2line based lookup making sense. . . # addr2line -e /usr/obj/DESTDIRs/clang-armv7dbg-installkernel_bisect/usr/lib/debug/boot/kernel/kernel.debug 0xc052d9f4 /usr/src_bisect/sys/vm/uma_core.c:792 line 792 is the sched_bind(curthread, cpu) line in: static void cache_drain_safe(uma_zone_t zone) { int cpu; /* * Polite bucket sizes shrinking was not enouth, shrink aggressively. */ if (zone) cache_shrink(zone); else zone_foreach(cache_shrink); CPU_FOREACH(cpu) { thread_lock(curthread); sched_bind(curthread, cpu); thread_unlock(curthread); if (zone) cache_drain_safe_cpu(zone); else zone_foreach(cache_drain_safe_cpu); } thread_lock(curthread); sched_unbind(curthread); thread_unlock(curthread); }
(In reply to Jeff Roberson from comment #20) I'll try that patch but remember that I started the bisect because jumping to -r326726 had the problem. (I jumped from -r326192.) So I started with this patch present when I started my bisect effort.
Thank you Mark, that makes more sense. It looks like we bound to some cpu but never ran again. Does arm do anything strange with its cpus? are they 0..mp_maxid? Anything odd with scheduling?
(In reply to Jeff Roberson from comment #20) Interesting. I had a successful boot of -r326347 with that patch present. I'll spend some time testing if it seems to reliably boot --after getting something to eat. So I may have to retry -r326726 and bisect again if -r326347 with the patch seems to reliably work but -r326726 does not seem to reliably work. (We will see.)
(In reply to Jeff Roberson from comment #23) I'm afraid I do not have the right background knowledge for questions like "are they 0..mp_maxid? Anything odd with scheduling?" for the cores. (Sometimes I can analyze source enough to find answers to specific questions --other times not. Such tends to be time consuming relative to where I start from.)
Sorry, the output of: sysctl kern.smp sysctl kern.sched should tell me what I need to know. Thanks!
(In reply to Jeff Roberson from comment #26) The following is from the rpi2 booted with the patched -r326347: # sysctl kern.smp kern.smp.forward_signal_enabled: 1 kern.smp.topology: 0 kern.smp.cpus: 4 kern.smp.disabled: 0 kern.smp.active: 1 kern.smp.maxcpus: 4 kern.smp.maxid: 3 # sysctl kern.sched kern.sched.topology_spec: <groups> <group level="1" cache-level="2"> <cpu count="4" mask="f">0, 1, 2, 3</cpu> </group> </groups> kern.sched.steal_thresh: 2 kern.sched.steal_idle: 1 kern.sched.balance_interval: 127 kern.sched.balance: 1 kern.sched.affinity: 1 kern.sched.idlespinthresh: 157 kern.sched.idlespins: 10000 kern.sched.static_boost: 152 kern.sched.preempt_thresh: 80 kern.sched.interact: 30 kern.sched.slice: 12 kern.sched.quantum: 94488 kern.sched.name: ULE kern.sched.preemption: 1 kern.sched.cpusetsize: 4
So far the patched -r326347 has booted fine, both debug-kernel and non-debug-kernel builds. I'm now building -r326664 in hopes that it will be a good low bound if I have to start another bisect when I retry -r326726. For various reasons this is a full build. I buildworld and buildkernel even when I only install a trial kernel for a bisect. So, it will be a while. (It is a cross build, however.) I'm doing non-debug, in other words, the way I usually do things.
When I looked I found at least one change between -r326664 and -r326726 that had later been reverted. So I updated to -r326871 instead of testing -r326726 : # uname -apKU FreeBSD rpi2 12.0-CURRENT FreeBSD 12.0-CURRENT r326871M arm armv7 1200054 1200054 It seems to be booting from the USB SSD just fine: no hang-ups so far.
(In reply to Mark Millard from comment #29) I have progressed the rpi2 V1.1 to -r32888 and it is still booting and operating fine. See: https://lists.freebsd.org/pipermail/freebsd-arm/2017-December/017195.html for the details of how the boot is set up to get the kernel and world from the USB SSD, getting very little from the sdcard first.
On Pine64-LTS running c4bc9a29b59/r326935 it's still hanging when plugging an usb device db> print *uma_kmem_limit 29400000 db> print *uma_kmem_total 1813000 db> print *uma_reclaim_needed 0 db> show page vm_cnt.v_free_count: 487944 vm_cnt.v_inactive_count: 269 vm_cnt.v_active_count: 2995 vm_cnt.v_laundry_count: 0 vm_cnt.v_wire_count: 15673 vm_cnt.v_free_reserved: 726 vm_cnt.v_free_min: 3259 vm_cnt.v_free_target: 10858 vm_cnt.v_inactive_target: 16287 db> show vmem *kernel_arena vmem 0xffff000000a2ec00 'kernel arena' quantum: 4096 size: 413138944 inuse: 411766784 free: 1372160 busy tags: 1031 free tags: 4 inuse size free size 4096 541 2215936 3 12288 8192 343 2809856 0 0 16384 8 131072 0 0 20480 108 2211840 0 0 32768 5 163840 0 0 36864 1 36864 0 0 65536 8 524288 0 0 69632 1 69632 0 0 98304 1 98304 0 0 131072 5 720896 0 0 262144 6 1572864 0 0 524288 1 524288 0 0 1048576 1 1187840 1 1359872 4194304 1 7905280 0 0 268435456 1 391593984 0 0 db> c root@pine64-lts:~ # sysctl hw.usb.debug=6 hw.usb.debug: 0 -> 6 root@pine64-lts:~ # usbd_do_request_flags: udev=0xfffffd0000b4f000 bmRequestType=0xa3 bRequest=0x00 wValue=0x0000 wIndex=0x0001 wLength=0x000 4 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4e000 bmRequestType=0xa3 bRequest=0x00 wValue=0x0000 wIndex=0x0001 wLength=0x0004 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4e000 bmRequestType=0xa3 bRequest=0x00 wValue=0x0000 wIndex=0x0001 wLength=0x0004 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4e000 bmRequestType=0x23 bRequest=0x01 wValue=0x0010 wIndex=0x0001 wLength=0x0000 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4e000 bmRequestType=0xa3 bRequest=0x00 wValue=0x0000 wIndex=0x0001 wLength=0x0004 usbd_do_request_flags: Handle Request function is set usbd_req_reset_port: usbd_do_request_flags: udev=0xfffffd0000b4e000 bmRequestType=0x23 bRequest=0x01 wValue=0x0014 wIndex=0x0001 wLength=0x0000 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4e000 bmRequestType=0x23 bRequest=0x03 wValue=0x0004 wIndex=0x0001 wLength=0x0000 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4f000 bmRequestType=0xa3 bRequest=0x00 wValue=0x0000 wIndex=0x0001 wLength=0x0004 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4f000 bmRequestType=0x23 bRequest=0x01 wValue=0x0010 wIndex=0x0001 wLength=0x0000 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4f000 bmRequestType=0xa3 bRequest=0x00 wValue=0x0000 wIndex=0x0001 wLength=0x0004 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4e000 bmRequestType=0xa3 bRequest=0x00 wValue=0x0000 wIndex=0x0001 wLength=0x0004 usbd_do_request_flags: Handle Request function is set usbd_req_reset_port: port 1 reset returning error=USB_ERR_NORMAL_COMPLETION usbd_do_request_flags: udev=0xfffffd0000b4e000 bmRequestType=0xa3 bRequest=0x00 wValue=0x0000 wIndex=0x0001 wLength=0x0004 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4e000 bmRequestType=0x23 bRequest=0x01 wValue=0x0010 wIndex=0x0001 wLength=0x0000 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4e000 bmRequestType=0xa3 bRequest=0x00 wValue=0x0000 wIndex=0x0001 wLength=0x0004 usbd_do_request_flags: Handle Request function is set usbd_req_reset_port: usbd_req_reset_port: usbd_do_request_flags: udev=0xfffffd0000b4f000 bmRequestType=0x23 bRequest=0x01 wValue=0x0014 wIndex=0x0001 wLength=0x0000 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4f000 bmRequestType=0x23 bRequest=0x03 wValue=0x0004 wIndex=0x0001 wLength=0x0000 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4f000 bmRequestType=0xa3 bRequest=0x00 wValue=0x0000 wIndex=0x0001 wLength=0x0004 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4f000 bmRequestType=0x23 bRequest=0x01 wValue=0x0014 wIndex=0x0001 wLength=0x0000 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4e000 bmRequestType=0xa3 bRequest=0x00 wValue=0x0000 wIndex=0x0001 wLength=0x0004 usbd_do_request_flags: Handle Request function is set usbd_do_request_flags: udev=0xfffffd0000b4e000 bmRequestType=0xa3 bRequest=0x00 wValue=0x0000 wIndex=0x0001 wLength=0x0004 usbd_do_request_flags: Handle Request function is set usbd_req_reset_port: port 1 reset returning error=USB_ERR_NORMAL_COMPLETION usbd_do_request_flags: udev=0xfffffd0000b4f000 bmRequestType=0xa3 bRequest=0x00 wValue=0x0000 wIndex=0x0001 wLength=0x0004 usbd_do_request_flags: Handle Request function is set usb_alloc_device: parent_dev=0xfffffd00008d4900, bus=0xffff00005aef7320, parent_hub=0xfffffd0000b4f000, depth=1, port_index=0, port_no=1, spe ed=1, usb_mode=0 usb_alloc_device: device_index=2 usb_set_device_state: udev 0xfffffd0043231000 state DETACHED -> POWERED usb_set_device_state: done usb_alloc_device: calling usbd_req_set_address usbd_req_set_address: setting device address=2 usbd_do_request_flags: udev=0xfffffd0043231000 bmRequestType=0x00 bRequest=0x05 wValue=0x0002 wIndex=0x0000 wLength=0x0000 usbd_do_request_flags: calling usbd_ctrl_transfert_setup usbd_ctrl_transfer_setup: unsetup existing transfer usbd_ctrl_transfer_setup: Setup new transfer usbd_transfer_setup: Do stuff usbd_transfer_setup: Starting loop usbd_transfer_setup: Get matching endpoint usbd_transfer_setup: Done usbd_transfer_setup: set transfer endpoint pointer usbd_transfer_setup: Call host or device ctrl transfer setup usbd_transfer_setup: Done usbd_transfer_setup: Get matching endpoint usbd_transfer_setup: Done usbd_transfer_setup: set transfer endpoint pointer usbd_transfer_setup: Call host or device ctrl transfer setup usbd_transfer_setup: Done usbd_transfer_setup: Get matching endpoint usbd_transfer_setup: Done usbd_transfer_setup: set transfer endpoint pointer usbd_transfer_setup: Call host or device ctrl transfer setup Then it's just hanged and I can't enter ddb again. The patch for the extra debug printf is here : https://people.freebsd.org/~manu/0001-Add-a-lot-of-debug-printf.patch
(In reply to Jeff Roberson from comment #26) FYI: That same rpi2 with the same -r326888 head version just hung up at boot the same way as the earlier versions had. It suggests some sort of race condition or other source of variability. chain 34: thread 100001 (pid 1, kernel) blocked on sx "umadrain" XLOCK thread 100072 (pid 23, uma) is on a run queue db> bt Tracing pid 23 tid 100072 td 0xd6dd3ae0 cpu_switch() at cpu_switch+0x10 pc = 0xc05b8be8 lr = 0xc02af88c (sched_switch+0x7e4) sp = 0xd7dbdd48 fp = 0xd7dbdd78 sched_switch() at sched_switch+0x7e4 pc = 0xc02af88c lr = 0xc0288b40 (mi_switch+0x1ac) sp = 0xd7dbdd80 fp = 0xd7dbdda8 r4 = 0x00000034 r5 = 0x00000000 r6 = 0xd6dd3ae0 r7 = 0x00000000 r8 = 0x00000000 r9 = 0xc0816768 r10 = 0x00000100 mi_switch() at mi_switch+0x1ac pc = 0xc0288b40 lr = 0xc05609c0 (uma_reclaim_locked+0x258) sp = 0xd7dbddb0 fp = 0xd7dbddd8 r4 = 0xc0833600 r5 = 0x00000002 r6 = 0x00000000 r7 = 0x00000000 r8 = 0xc0834168 r9 = 0x00000000 r10 = 0x00000000 uma_reclaim_locked() at uma_reclaim_locked+0x258 pc = 0xc05609c0 lr = 0xc0560f4c (uma_reclaim_worker+0xd8) sp = 0xd7dbdde0 fp = 0xd7dbde18 r4 = 0xc095e528 r5 = 0xc095e538 r6 = 0xc09685c8 r7 = 0x00000000 r8 = 0xc0834168 r9 = 0xc095e53c r10 = 0x00000000 uma_reclaim_worker() at uma_reclaim_worker+0xd8 pc = 0xc0560f4c lr = 0xc0240144 (fork_exit+0xc0) sp = 0xd7dbde20 fp = 0xd7dbde40 r4 = 0xd6dd3ae0 r5 = 0xd6b4b000 r6 = 0xc0560e74 r7 = 0x00000000 r8 = 0xd7dbde48 r9 = 0xc09518c0 r10 = 0xd6b5b000 fork_exit() at fork_exit+0xc0 pc = 0xc0240144 lr = 0xc0599c58 (swi_exit) sp = 0xd7dbde48 fp = 0x00000000 r4 = 0xc0560e74 r5 = 0x00000000 r6 = 0xc1ed63a0 r7 = 0x00002738 r8 = 0xc0816768 r10 = 0xd6b5b000 swi_exit() at swi_exit pc = 0xc0599c58 lr = 0xc0599c58 (swi_exit) sp = 0xd7dbde48 fp = 0x00000000 lr = 0xc05609c0 (uma_reclaim_locked+0x258) ends up being /usr/src_bisect/sys/vm/uma_core.c:792 (like before). db> bt Tracing pid 1 tid 100001 td 0xc3770000 cpu_switch() at cpu_switch+0x10 pc = 0xc05b8be8 lr = 0xc02af88c (sched_switch+0x7e4) sp = 0xc3776718 fp = 0xc3776748 sched_switch() at sched_switch+0x7e4 pc = 0xc02af88c lr = 0xc0288b40 (mi_switch+0x1ac) sp = 0xc3776750 fp = 0xc3776778 r4 = 0x000004c5 r5 = 0x00000000 r6 = 0xc3770000 r7 = 0x00000000 r8 = 0x00000000 r9 = 0xc0816768 r10 = 0x00000104 mi_switch() at mi_switch+0x1ac pc = 0xc0288b40 lr = 0xc02d7964 (sleepq_wait+0x2c) sp = 0xc3776780 fp = 0xc3776798 r4 = 0xc3770000 r5 = 0x00000000 r6 = 0xc095e528 r7 = 0xc095e538 r8 = 0x00000001 r9 = 0x00000001 r10 = 0xc095e528 sleepq_wait() at sleepq_wait+0x2c pc = 0xc02d7964 lr = 0xc028789c (_sx_slock_hard+0x308) sp = 0xc37767a0 fp = 0xc37767f8 r4 = 0x00000001 r5 = 0x00000000 r6 = 0x00000000 r10 = 0xc095e528 _sx_slock_hard() at _sx_slock_hard+0x308 pc = 0xc028789c lr = 0xc055dbe8 (uma_zcreate+0xa4) sp = 0xc3776800 fp = 0xc3776840 r4 = 0x00000000 r5 = 0xc095e528 r6 = 0x00000078 r7 = 0xc06f8205 r8 = 0xd6f32048 r9 = 0xc3770000 r10 = 0xd6f32000 uma_zcreate() at uma_zcreate+0xa4 pc = 0xc055dbe8 lr = 0xc053b8c0 (ffs_mount+0x80) sp = 0xc3776848 fp = 0xc3776978 r4 = 0xd6f32000 r5 = 0x00000000 r6 = 0x00000003 r7 = 0xc095e2a8 ffs_mount() at ffs_mount+0x80 pc = 0xc053b8c0 lr = 0xc0344fbc (vfs_donmount+0x1490) sp = 0xc3776980 fp = 0xc3776b38 r4 = 0xffffffff r5 = 0xc0812870 r6 = 0xc419afb0 r7 = 0x00000000 r8 = 0xd6f32048 r9 = 0x00000000 r10 = 0xd6f32000 vfs_donmount() at vfs_donmount+0x1490 pc = 0xc0344fbc lr = 0xc03481c8 (kernel_mount+0x3c) sp = 0xc3776b40 fp = 0xc3776b78 r4 = 0xd7826000 r5 = 0x00000000 r6 = 0xd7826000 r7 = 0x00000000 r8 = 0xd6f2d010 r9 = 0xc418ffb0 r10 = 0x00000000 kernel_mount() at kernel_mount+0x3c pc = 0xc03481c8 lr = 0xc034ad64 (parse_mount+0x458) sp = 0xc3776b80 fp = 0xc3776c58 r4 = 0xd784c100 r5 = 0xd6f2d014 r6 = 0xd7826000 r10 = 0x00000000 parse_mount() at parse_mount+0x458 pc = 0xc034ad64 lr = 0xc034950c ($a.2+0x24) sp = 0xc3776c60 fp = 0xc3776dc0 r4 = 0xd6f33018 r5 = 0xfffffff7 r6 = 0xc3770000 r7 = 0xc41a67da r8 = 0xd6f2c080 r9 = 0x00000000 r10 = 0xd6f33000 $a.2() at $a.2+0x24 pc = 0xc034950c lr = 0xc02168bc (start_init+0x60) sp = 0xc3776dc8 fp = 0xc3776e18 r4 = 0xc3770000 r5 = 0xc376c000 r6 = 0xc021685c r7 = 0x00000001 r8 = 0xc376c000 r9 = 0x00000000 r10 = 0x00000000 start_init() at start_init+0x60 pc = 0xc02168bc lr = 0xc0240144 (fork_exit+0xc0) sp = 0xc3776e20 fp = 0xc3776e40 r4 = 0xc3770000 r5 = 0xc376c000 r6 = 0xc021685c r7 = 0x00000000 r8 = 0xc3776e48 r9 = 0x00000000 r10 = 0x00000000 fork_exit() at fork_exit+0xc0 pc = 0xc0240144 lr = 0xc0599c58 (swi_exit) sp = 0xc3776e48 fp = 0x00000000 r4 = 0xc021685c r5 = 0x00000000 r6 = 0x00000000 r7 = 0x00000000 r8 = 0x00000000 r10 = 0x00000000 swi_exit() at swi_exit pc = 0xc0599c58 lr = 0xc0599c58 (swi_exit) sp = 0xc3776e48 fp = 0x00000000 db> print *uma_kmem_limit 107f2000 db> print *uma_kmem_total 8d8000 db> print *uma_reclaim_needed 7 db> show page vm_cnt.v_free_count: 199675 vm_cnt.v_inactive_count: 0 vm_cnt.v_active_count: 0 vm_cnt.v_laundry_count: 0 vm_cnt.v_wire_count: 3035 vm_cnt.v_free_reserved: 329 vm_cnt.v_free_min: 1341 vm_cnt.v_free_target: 4377 vm_cnt.v_inactive_target: 6565 db> show vmem *kernel_arena vmem 0xc0958c80 'kernel arena' quantum: 4096 size: 352321536 inuse: 351490048 free: 831488 busy tags: 1143 free tags: 3 inuse size free size 4096 985 4034560 0 0 8192 11 90112 0 0 12288 84 1032192 0 0 16384 14 229376 0 0 20480 2 40960 0 0 32768 5 163840 0 0 36864 1 36864 0 0 65536 11 720896 0 0 94208 1 94208 0 0 131072 19 2990080 1 184320 262144 1 270336 0 0 524288 2 1708032 1 647168 2097152 1 2097152 0 0 4194304 2 12357632 0 0 8388608 1 9773056 0 0 16777216 2 52510720 0 0 134217728 1 263340032 0 0
(In reply to Mark Millard from comment #32) Can you set uma_kmem_limit = ULONG_MAX; where it is declared in uma_core.c and test a theory?
(In reply to Jeff Roberson from comment #33) [Note: The problem was reproduced with a non-debug kernel build and this is also such. I normally do not run debug builds.] For reference: /* kmem soft limit. */ static unsigned long uma_kmem_limit = ULONG_MAX; static volatile unsigned long uma_kmem_total; It still hung up. But it does look a little different: chain 36: thread 100000 (pid 0, swapper) blocked on lock 0xc0967474 (sleep mutex) "process lock" thread 100081 (pid 0, if_io_tqg_3) is on a run queue if_io_tqg_3 is new. But the allchain output eventually got: chain 45: thread 100056 (pid 0, CAM taskq) sleeping on 0xc3764680 "-" chain 46: thread 100073 (pid 0, if_config_tqg_0) blocked on lockmgr (null)EXCL KDB: reentering KDB: stack backtrace: db_trace_self() at db_trace_self pc = 0xc05972cc lr = 0xc0060f98 (db_trace_self_wrapper+0x30) . . . exception_exit() at exception_exit pc = 0xc0599cc8 lr = 0xc0258684 (lockmgr_chain+0x60) sp = 0xc3783858 fp = 0xc3783880 r0 = 0x75657571 r1 = 0xc06cbd2d r2 = 0x00000000 r3 = 0x600001d3 r4 = 0xc06cbd2d r5 = 0xc06ddb13 r6 = 0xc0964bf8 r7 = 0xc415b220 r8 = 0xc06ddb69 r9 = 0xc3783860 r10 = 0xc07fa250 r12 = 0x00000003 print_lockchain() at print_lockchain+0x58 pc = 0xc02e1898 lr = 0xc02e1ad0 (db_show_allchains+0x9c) sp = 0xc3783888 fp = 0xc37838a0 r4 = 0x0000002e r5 = 0xc09673c8 r6 = 0xc0964bf8 r7 = 0xd6dd3740 r8 = 0xc06ddbb2 r9 = 0xc06cbd2d r10 = 0x00000000 db_show_allchains() at db_show_allchains+0x9c pc = 0xc02e1ad0 lr = 0xc005d934 (db_command+0x268) sp = 0xc37838a8 fp = 0xc3783948 r4 = 0x00000001 r5 = 0x00000000 r6 = 0xc06f7397 r7 = 0xc02e1a34 r8 = 0xc0964b74 r9 = 0xc07fb448 . . .
(In reply to Mark Millard from comment #34) In: QUOTE It still hung up. But it does look a little different: chain 36: thread 100000 (pid 0, swapper) blocked on lock 0xc0967474 (sleep mutex) "process lock" thread 100081 (pid 0, if_io_tqg_3) is on a run queue if_io_tqg_3 is new. END QUOTE. I also should have said the swapper is new. And "a little" was silly.
(In reply to Mark Millard from comment #35) I'm sorry, I was a little distracted and missed sending the part that was similar to the past reports: chain 35: thread 100001 (pid 1, kernel) blocked on sx "umadrain" XLOCK thread 100072 (pid 23, uma) is on a run queue The chain that I reported in #33 and #34 is in addition to the the above, chains repeated below: chain 36: thread 100000 (pid 0, swapper) blocked on lock 0xc0967474 (sleep mutex) "process lock" thread 100081 (pid 0, if_io_tqg_3) is on a run queue . . . chain 46: thread 100073 (pid 0, if_config_tqg_0) blocked on lockmgr (null)EXCL KDB: reentering KDB: stack backtrace: . . .
In uma_reclaim_wakeup() can you add: printf("limit %lX, total %lX, needed %d\n", uma_kmem_limit, uma_kmem_total, uma_reclaim_needed); I'm not sure why you're hitting the limit at all during boot.
(In reply to Jeff Roberson from comment #37) Is the ULONG_MAX still required? It did not look like I'd be likely to ever boot with it. Note: I reverted and tried some more and it booted the last time that I tried. So it is currently at a sh prompt. Racy or dependent on uninitialized memory that is not stable in value or some such?
I think the ulong makes sense. There are possibly a few allocations that happen before the value is initialized and these could trigger wakeups. Can you add the printf and show me the output from it?
(In reply to Jeff Roberson from comment #39) With ULONG_MAX it died in a state that could not even complete a show allchains at the db> prompt. This might interfere with gathering other information. For reference: /* kmem soft limit. */ static unsigned long uma_kmem_limit = ULONG_MAX; static volatile unsigned long uma_kmem_total; . . . void uma_reclaim_wakeup(void) { printf("limit %lX, total %lX, needed %d\n", uma_kmem_limit, uma_kmem_total, uma_reclaim_needed); if (atomic_fetchadd_int(&uma_reclaim_needed, 1) == 0) wakeup(uma_reclaim); } I'll reply when I've got it in place.
ah it should be LONG_MAX not ULONG_MAX.
(In reply to Mark Millard from comment #40) Successful boots do not show the printf output (so far). A bunch of attempts have not gotten a boot hangup. It looks like I'll just have to use it with the two changes and hope for a failure with some output from the new printf at some point. I have no clue how to force a failure on demand. I'll note that after the -r326871 on the 15th (comment #29) the rpi2 spent most of its time doing poudriere build experiments while I did other things. So it has not been rebooted all that much since then.
(In reply to Jeff Roberson from comment #41) Okay. I'll update to LONG_MAX. But my guess is that it will still be a "use it until failure" sort of context with no quick result.
(In reply to Mark Millard from comment #43) Looks like I installed the wrong kernel initially, no printf available to output. So this is the first test with the printf in place. . . Still no failures. Looks like "use until a failure shows up". For reference: /* kmem soft limit. */ static unsigned long uma_kmem_limit = LONG_MAX; static volatile unsigned long uma_kmem_total; . . . void uma_reclaim_wakeup(void) { printf("limit %lX, total %lX, needed %d\n", uma_kmem_limit, uma_kmem_total, uma_reclaim_needed); if (atomic_fetchadd_int(&uma_reclaim_needed, 1) == 0) wakeup(uma_reclaim); }
What it looks like to me is that we woke up the reclaim thread multiple times early in boot before arm was prepared to run the thread and bind to multiple CPUs. There may be a race that makes it work sometimes. I believe the = LONG_MAX fix should resolve the issue.
(In reply to Jeff Roberson from comment #45) Okay. I'll report here if I see another hang. If I progress past -r326888 I'll be sure that the LONG_MAX is still present --and for now I'll leave the printf as well. Thanks.
The LONG_MAX doesn't change anything for me on Pine64.
(In reply to Emmanuel Vadot from comment #47) I have replicated your hangs-on-plugin-of- USB-drive problem on a Pine64+ 2GB. The printf did not output. I first updated the Pine64+ 2GB that I have access to be based on the patched -r326888 based source that I've been using for my recent rpi2 experiments. It boots via the usdcard slot, not USB. (eMMC via an adapter.) Plugging in a USB drive (via plugging the a USB hub it is on) hangs the Pine64+ 2GB. <CR>~^B does not get to the db> prompt. This was with a non-debug kernel.
After debugging this further on IRC we believe this may be an existing pmap race that is made worse by combining kernel/kmem arenas. Disabling SMP makes the problem go away. Adding printfs makes it clear that it hangs somewhere in vm_page_alloc_contig, but printing in this function changes timing sufficiently that the problem goes away. JHB noticed that there are no IPIs in pmap_change_attr on arm. Could this be a problem with the direct map?
(In reply to Mark Millard from comment #48) I tried plugging in just the hub, without the drive. That worked. On the console serial port it shows the following when the hub is plugged in: ugen0.2: <vendor 0x05e3 product 0x0610> at usbus0 uhub2 on uhub0 uhub2: <vendor 0x05e3 product 0x0610, class 9/0, rev 2.00/92.10, addr 2> on usbus0 uhub2: MTT enabled uhub2: 4 ports with 1 removable, self powered And the following when unplugged: ugen0.2: <vendor 0x05e3 product 0x0610> at usbus0 (disconnected) uhub2: at uhub0, port 1, addr 2 (disconnected) uhub2: detached
(In reply to Mark Millard from comment #50) I have no such problem with plugging in USB drives on the rpi3 that is running -r326888 (unpatched as yet). For example: ugen0.4: <vendor 0x05e3 product 0x0610> at usbus0 uhub2 on uhub1 uhub2: <vendor 0x05e3 product 0x0610, class 9/0, rev 2.00/92.10, addr 4> on usbus0 uhub2: MTT enabled uhub2: 4 ports with 1 removable, self powered ugen0.5: <OWC Envoy Pro mini> at usbus0 umass0 on uhub2 umass0: <OWC Envoy Pro mini, class 0/0, rev 2.10/1.00, addr 5> on usbus0 umass0: SCSI over Bulk-Only; quirks = 0x0100 umass0:0:0: Attached to scbus0 da0 at umass-sim0 bus 0 scbus0 target 0 lun 0 da0: <OWC Envoy Pro mini 0> Fixed Direct Access SPC-4 SCSI device da0: Serial Number <REPLACED> da0: 40.000MB/s transfers da0: 228936MB (468862128 512 byte sectors) da0: quirks=0x2<NO_6_BYTE> (Same hub and drive as I used on the Pine64+ 2GB.)
(In reply to Jeff Roberson from comment #45) I've tried building -r327075 for various TARGET_ARCH's and they fail to compile with LONG_MAX in place: TARGET_ARCH=powerpc (via gcc 4.2.1, llvm fail to build it normally) --- uma_core.o --- Compiler executable checksum: 9060ea2e8b86c7caeab9ffe5a850239f /usr/src/sys/vm/uma_core.c:151: error: 'LONG_MAX' undeclared here (not in a function) *** [uma_core.o] Error code 1 make[2]: stopped in /usr/obj/powerpcvtsc_clang_gcc421/powerpc.powerpc/usr/src/powerpc.powerpc/sys/GENERICvtsc-NODBG TARGET_ARCH=powerpc64 (via system-clang) --- uma_core.o --- static unsigned long uma_kmem_limit = LONG_MAX; ^ --- uma_core.o --- /usr/src/sys/vm/uma_core.c:151:39: error: use of undeclared identifier 'LONG_MAX' --- uma_core.o --- 1 error generated. And so on. Apparently no header that defines it has been included.
A commit references this bug: Author: jeff Date: Tue Jan 2 04:35:57 UTC 2018 New revision: 327485 URL: https://svnweb.freebsd.org/changeset/base/327485 Log: Fix arc after r326347 broke various memory limit queries. Use UMA features rather than kmem arena size to determine available memory. Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before the real limit is set. PR: 224330 (partial), 224080 Reviewed by: markj, avg Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13494 Changes: head/sys/cddl/compat/opensolaris/sys/kmem.h head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c head/sys/vm/uma.h head/sys/vm/uma_core.c head/sys/vm/uma_int.h
Andy, I am assigning this to you because JHB and I believe it is an existing race in pmap. Changing timing between calls to the allocator and pmap_change_attr() resolves the issue.
May I ask when can we expect a fix for this? I would like to update an rpi3 and a pine64, but this is blocking them from booting on their own. boot -v works this around, however I can't always have a keyboard/console attached to manually aid the booting process.
(In reply to gergely.czuczy from comment #55) So far as I know, at this point no one knows just what source code is wrong. No specific source code fixes have been identified from what I can tell. So it looks to be an open-ended effort at this point: no pre-kniown schedule is likely.
(In reply to gergely.czuczy from comment #55) For what it's worth, setting boot_verbose="YES" in loader.conf(5) will also do the trick and alleviate the need for serial on boot. This is what I do for my Pine64, at least.
That works for my pine64, however that doesn't do the trick for the rpi3. And unfortunately my rpi3 is the brewery controller, where i don't have the opportunity to attach a serial console. So that's why it's a very serious issue for me, I can't update my beer brewing stack because of this bug :)
For reference USB is also non-functional on my arm64 Softiron Overdrive 1000, at r329060.
(In reply to Ed Maste from comment #59) FYI: on a Pine64+ 2GiB running head -r333157 I can plug in a USB drive into the bottom USB port, and mount and use /dev/da0p1: => 34 468862061 da0 GPT (224G) 34 2014 - free - (1.0M) 2048 413138944 1 PINE642Groot (197G) 413140992 12582912 2 PINE642Gswap (6.0G) 425723904 131072 3 Basic data partition (64M) 425854976 43007119 - free - (21G) As I remember the upper USB port was ignored when I tried there. Note: I'm not booting from USB currently. Instead I'm using a EMMC on an adapter to provide /dev/mmcsd0 and I'm booting it from that.
Fixed a while ago
(In reply to Emmanuel Vadot from comment #61) Reconfirmed recently for head -r337400 on a Pine64+ 2GB on the lower USB port. (It had been a very long time since I'd updated/tried.) In this case the root file system and swap partition were on the USB drive, unlike comment #60.
(In reply to Mark Millard from comment #62) No, this is not what this bug is about (whatever you are talking about) The board doesn't hang If you have another problem please fill a new PR with info on how to reproduce.
(In reply to Emmanuel Vadot from comment #63) My wording was unclear: I was trying to report that things were working for the port I happened to use: no hangs. I've recently done over 80 hours of builds (world, kernel, ports) based on USB root file systems and swap partitions, multiple reboots, power downs, and power ups involved. It all worked fine. I happened to use the lower USB port.