Hi, We were performing file system operations using du,cp,mv,shutil.move/rmtree and os.rename. Observed one of the directory is going for toss in 13.1 and 13.2 (also checked in iso of 13.0,12.4 same issues observed). Yet to verify in current. rtestuser@freeBSD13_2:~ $ uname -a FreeBSD freeBSD13_2.cs39 13.2-RELEASE FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC amd64 rtestuser@freeBSD13_2:~ $ cd /data/infra/ rtestuser@freeBSD13_2:/data/infra $ ps -aux | grep python root 726 0.0 9.1 23100 8768 v0 I 14:34 14:51.95 python move-move-delete.py (python3.9) root 3332 0.0 7.3 18800 7028 v0 I 14:35 0:00.03 python du.py (python3.9) root 23350 0.0 7.3 18800 7028 v0 I 15:07 0:00.02 python du.py (python3.9) rtestuser 48818 0.0 2.1 12812 1984 0 S+ 15:53 0:00.00 grep python rtestuser@freeBSD13_2:/data/infra $ ls createfiles.py dir1 dir4_ren du.py move-move-delete.py rtestuser@freeBSD13_2:/data/infra $ cat move-move-delete.py import os import shutil import time path2 = "/data/infra/dir2_ren" path3 = "/data/infra/dir3_ren" path4 = "/data/infra/dir4_ren" while 1: os.rename("/data/infra/dir1/","/data/infra/dir2_ren/") os.system("ls -ir /data/infra/") print("rename done") os.mkdir(path3) files = [f for f in os.listdir("/data/infra/dir2_ren/")] for file in files: old = "/data/infra/dir2_ren/" + file ipath = "/data/infra/dir2_ren/" + file + "/" files1 = [af for af in os.listdir(ipath)] for ff in files1: fpath = "/data/infra/dir2_ren/" + file + "/" + ff #keeping files opened to check it further p = open(fpath,"w") p1 = open(fpath,"w") print("I AM WRITTING") p.write("I am writting now!") new = "/data/infra/dir3_ren/" + file shutil.move(old,new) print("shutil.move done") shutil.rmtree("/data/infra/dir2_ren") print("deleted dir2_ren") os.system("ls -ir /data/infra/") os.mkdir(path4) os.system("mv /data/infra/dir3_ren/* /data/infra/dir4_ren/") print("os mv command success") shutil.rmtree("/data/infra/dir3_ren") print("deleted dir3_ren") print("after dir3_ren") os.system("ls -ir /data/infra/") os.system("mkdir /data/infra/dir1") os.system("cp -rf /data/infra/dir4_ren/* /data/infra/dir1/") print("dir1 created back and copied data from dir4 to dir1") shutil.rmtree("/data/infra/dir4_ren") rtestuser@freeBSD13_2:/data/infra $ cat du.py import os import time; while 1: os.system("du /") time.sleep(10) rtestuser@freeBSD13_2:/data/infra $ ls -lrt dir1/ total 0 rtestuser@freeBSD13_2:/data/infra $ ls -lrt dir4_ren/ ^C Note: In next run got stuck in shutil.move. rtestuser@freeBSD13_2:~ $ uname -r 13.2-RELEASE rtestuser@freeBSD13_2:~ $ uname -a FreeBSD freeBSD13_2.cs39 13.2-RELEASE FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC amd64 rtestuser@freeBSD13_2:~ $ top last pid: 76719; load averages: 2.02, 1.48, 1.04 up 0+00:27:10 18:00:02 28 processes: 3 running, 25 sleeping CPU: 19.5% user, 0.0% nice, 44.8% system, 2.4% interrupt, 33.3% idle Mem: 25M Active, 7184K Inact, 2388K Laundry, 54M Wired, 12M Buf, 3496K Free Swap: 723M Total, 13M Used, 710M Free, 1% Inuse PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 60795 root 1 103 0 22M 8908K RUN 3:23 100.00% python3.9 4544 rtestuser 1 20 0 21M 4576K select 0:00 0.00% sshd 76693 root 1 37 0 15M 2000K ufs 0:00 0.00% du 76694 root 1 36 0 15M 2000K ufs 0:00 0.00% du 76662 rtestuser 1 20 0 15M 4700K ufs 0:00 0.00% du 758 root 1 20 0 16M 1540K ttyin 0:00 0.00% csh 8447 root 1 20 0 18M 6888K wait 0:00 0.00% python3.9 764 root 1 20 0 18M 3764K wait 0:00 0.00% python3.9 702 root 1 20 0 18M 2172K select 0:00 0.00% sendmail 76699 root 1 21 0 21M 7924K select 0:00 0.00% sshd 4396 root 1 21 0 21M 4452K select 0:00 0.00% sshd 4550 rtestuser 1 21 0 13M 2464K wait 0:00 0.00% sh 596 root 1 20 0 13M 1412K select 0:00 0.00% syslogd 76703 rtestuser 1 20 0 13M 2852K wait 0:00 0.00% sh .... rtestuser@freeBSD13_2:/data/infra $ ls createfiles.py dir2_ren dir3_ren du.py move-move-delete.py rtestuser@freeBSD13_2:/data/infra $ ls dir3_ren/ dir0 dir1 dir10 dir11 dir12 dir13 dir14 dir15 rtestuser@freeBSD13_2:/data/infra $ ls dir2_ren ^C ^C System config: 128MB RAM, CPU core -2, Disk-250GB. Thanks & regards, Shamsher
Reason for less RAM. With high RAM values (6/8GB) we were seeing same issues in months time. There to repro it quickly we have reduced RAM to 128MB to get issues in approx 30-minutes time.
Note: No issues seen in FreeBSD10.4.
So where exactly does the process hang? procstat kstack <pid>
root@freeBSD13_2:/data/infra # top last pid: 3279; load averages: 2.77, 2.29, 1.33 up 0+00:12:04 12:32:42 22 processes: 4 running, 18 sleeping CPU: 1.3% user, 0.0% nice, 74.7% system, 0.1% interrupt, 23.9% idle Mem: 30M Active, 8620K Inact, 1760K Laundry, 48M Wired, 8660K Buf, 2428K Free Swap: 723M Total, 128K Used, 723M Free PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 794 root 1 95 0 22M 8692K RUN 0:43 56.98% python3.9 3147 root 1 102 0 15M 4548K RUN 8:18 40.97% du 786 root 1 20 0 21M 5812K RUN 0:00 0.00% sshd 733 root 1 20 0 16M 1788K pause 0:00 0.00% csh 789 root 1 20 0 16M 3172K pause 0:00 0.00% csh 703 root 1 20 0 18M 4196K select 0:00 0.00% sendmail 1640 root 1 21 0 18M 6968K wait 0:00 0.00% python3.9 725 root 1 20 0 13M 8192B wait 0:00 0.00% <login> 597 root 1 20 0 13M 2028K select 0:00 0.00% syslogd 710 root 1 20 0 13M 648K nanslp 0:00 0.00% cron 400 root 1 20 0 11M 672K select 0:00 0.00% devd 785 root 1 20 0 21M 4644K select 0:00 0.00% sshd 3274 root 1 20 0 13M 2076K ufs 0:00 0.00% ls 732 root 1 52 0 13M 1612K ttyin 0:00 0.00% getty 730 root 1 52 0 13M 1616K ttyin 0:00 0.00% getty 729 root 1 52 0 13M 1624K ttyin 0:00 0.00% getty 731 root 1 52 0 13M 1616K ttyin 0:00 0.00% getty 728 root 1 52 0 13M 1612K ttyin 0:00 0.00% getty 726 root 1 52 0 13M 1616K ttyin 0:00 0.00% getty 727 root 1 52 0 13M 1616K ttyin 0:00 0.00% getty 706 smmsp 1 52 0 18M 8192B pause 0:00 0.00% <sendmail> 3279 root 1 20 0 14M 2604K RUN 0:00 0.00% top root@freeBSD13_2:/data/infra # ps auxwww | grep 3147 root 3147 48.0 4.7 15368 4548 0 R 12:23 8:20.34 du / root 3281 0.0 0.3 436 256 0 R+ 12:32 0:00.00 grep 3147 root@freeBSD13_2:/data/infra # procstat kstack 3147 PID TID COMM TDNAME KSTACK 3147 100116 du - mi_switch intr_event_handle intr_execute_handlers Xapic_isr1 ffs_lock _vn_lock vget_finish vfs_hash_get ffs_vgetf ufs_lookup_ino vfs_cache_lookup VOP_LOOKUP cache_fplookup_noentry cache_fplookup namei kern_statat sys_fstatat amd64_syscall root@freeBSD13_2:/data/infra # uname -a FreeBSD freeBSD13_2.cs39 13.2-RELEASE FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC amd64 root@freeBSD13_2:/data/infra #
hang is random. Sometimes it hangs at du + (mv/shuril.move/cp). But most of the time it's hanging at mv + du. Note: du need to run is separate process.
Also just for info when replacing df from du then not seeing hang in any folder. Looks with df combination it's working fine but not with du combination. $ cat du.py import os import time; while 1: os.system("du /") <<-- here replaced du with df time.sleep(5) note: dir1: subdir + file have been created inside /data/infra folder as below: $ cat createfiles.py import os i = 0 import time while i < 50: j = 0 path = "/data/infra/dir1/dir" + str(i) os.makedirs(path) while j < 10: path1 = "/data/infra/dir1/dir" + str(i) + "/file" + str(j) # cmd = "touch " + path1 # os.system(cmd) f = open(path1,"w") f.write("Hit the issue") j += 1 i+=1 Steps : 1- create dir : /data/infra 2- inside above folder execute "python createfiles.py" 3- run "python mode-move-delete.py >/dev/null & " 4- run "python du.py > /dev/null &" Wait for sometime issues will get hit in one of the folder. Sys config/resource- cpu:1, RAM:128M, Disk:250GB
We have loaded FreeBSD13.2 with ZFS file system also. Observed Result is same. Note: For ZFS system, RAM used as 192MB.
same hang issues seeing using diskus also. :(
lock order reversal: 1st 0xfffff800028677f0 ufs (ufs, lockmgr) @ kern/vfs_mount.c:1742 2nd 0xfffff8000286dcf0 devfs (devfs, lockmgr) @ kern/vfs_subr.c:3191 lock order devfs -> ufs established at: #0 0xffffffff80682cbd at witness_checkorder+0x46d #1 0xffffffff805eacf2 at lockmgr_lock_flags+0x172 #2 0xffffffff808617fd at ffs_lock+0x9d #3 0xffffffff809bf090 at VOP_LOCK1_APV+0x40 #4 0xffffffff80711cd4 at _vn_lock+0x54 #5 0xffffffff806ef8fd at vfs_domount+0xf1d #6 0xffffffff806edd02 at vfs_donmount+0x872 #7 0xffffffff806f24b7 at kernel_mount+0x57 #8 0xffffffff806f4e61 at parse_mount+0x4a1 #9 0xffffffff806f32e7 at vfs_mountroot+0x587 #10 0xffffffff805a9adf at start_init+0x1f #11 0xffffffff805d5750 at fork_exit+0x80 #12 0xffffffff80916f1e at fork_trampoline+0xe lock order ufs -> devfs attempted at: #0 0xffffffff8068361c at witness_checkorder+0xdcc #1 0xffffffff805eacf2 at lockmgr_lock_flags+0x172 #2 0xffffffff809bf090 at VOP_LOCK1_APV+0x40 #3 0xffffffff80711cd4 at _vn_lock+0x54 #4 0xffffffff806f9ed6 at vput_final+0x116 #5 0xffffffff8085f4ef at ffs_unmount+0x33f #6 0xffffffff806f09eb at dounmount+0x43b #7 0xffffffff806f053c at kern_unmount+0x2fc #8 0xffffffff80943892 at amd64_syscall+0x122 #9 0xffffffff8091683e at fast_syscall_common+0xf8 can we change below value? looks like this having some impact: debug.vn_lock_pair_pause_max
do u have RAG for vn_lock_pair(...) (freebsd-src/sys/kern/vfs_vnops.c) ?
tried with debug.vn_lock_pair_pause_max=10 also with 128MB. in stead of ~30min; process got hanged after ~5days :(
db> show lockedvnods Locked vnodes vnode 0xfffff800060db780: type VDIR usecount 9, writecount 0, refcount 2 seqc users 0 mountedhere 0 hold count flags () flags (VMP_LAZYLIST) v_object 0xfffff80004000c60 ref 0 pages 0 cleanbuf 0 dirtybuf 1 lock type ufs: SHARED (count 1) with exclusive waiters pending with shared waiters pending #0 0xffffffff805ead55 at lockmgr_lock_flags+0x1d5 #1 0xffffffff808619fd at ffs_lock+0x9d #2 0xffffffff809bf090 at VOP_LOCK1_APV+0x40 #3 0xffffffff80711cd4 at _vn_lock+0x54 #4 0xffffffff806f97aa at vget_finish+0x7a #5 0xffffffff806e8d7f at vfs_hash_get+0xbf #6 0xffffffff8085cb88 at ffs_vgetf+0x58 #7 0xffffffff808697fa at ufs_lookup_ino+0x102a #8 0xffffffff809bbb66 at VOP_CACHEDLOOKUP_APV+0x56 #9 0xffffffff806da636 at vfs_cache_lookup+0xa6 #10 0xffffffff809bb9b6 at VOP_LOOKUP_APV+0x56 #11 0xffffffff806eb431 at lookup+0x481 #12 0xffffffff806ea786 at namei+0x4d6 #13 0xffffffff8070d7f9 at kern_renameat+0x109 #14 0xffffffff80943eba at amd64_syscall+0x74a #15 0xffffffff80916b1e at fast_syscall_common+0xf8 nlink=52, effnlink=10, size=1024, extsize 0 generation=d605dcda, uid=0, gid=0, flags=0x0 ino 1218572, on dev da0p9 vnode 0xfffff80002f00780: type VDIR usecount 0, writecount 0, refcount 2 seqc users 1 mountedhere 0 hold count flags () flags (VIRF_DOOMED) v_object 0xfffff80003ff4a50 ref 0 pages 0 cleanbuf 0 dirtybuf 0 lock type ufs: EXCL by thread 0xfffffe0057960800 (pid 1568, mv, tid 100157) #0 0xffffffff805ead55 at lockmgr_lock_flags+0x1d5 #1 0xffffffff808619fd at ffs_lock+0x9d #2 0xffffffff809bf090 at VOP_LOCK1_APV+0x40 #3 0xffffffff80711cd4 at _vn_lock+0x54 #4 0xffffffff806f97aa at vget_finish+0x7a #5 0xffffffff806e8d7f at vfs_hash_get+0xbf #6 0xffffffff8085cb88 at ffs_vgetf+0x58 #7 0xffffffff80869a9c at ufs_lookup_ino+0x12cc #8 0xffffffff809bbb66 at VOP_CACHEDLOOKUP_APV+0x56 #9 0xffffffff806da636 at vfs_cache_lookup+0xa6 #10 0xffffffff809bb9b6 at VOP_LOOKUP_APV+0x56 #11 0xffffffff806eb431 at lookup+0x481 #12 0xffffffff806ea786 at namei+0x4d6 #13 0xffffffff8070d7f9 at kern_renameat+0x109 #14 0xffffffff80943eba at amd64_syscall+0x74a #15 0xffffffff80916b1e at fast_syscall_common+0xf8 nlink=3, effnlink=2, size=512, extsize 0 generation=77c90b5e, uid=0, gid=0, flags=0x0 ino 1219839, on dev da0p9 db>
last pid: 81732; load averages: 2.86, 2.92, 2.89 up 0+01:51:23 06:29:09 42 processes: 3 running, 37 sleeping, 2 zombie CPU: 0.9% user, 0.0% nice, 94.1% system, 0.7% interrupt, 4.3% idle Mem: 27M Active, 2692K Inact, 5164K Laundry, 61M Wired, 20M Buf, 2704K Free Swap: 6144M Total, 164M Used, 5980M Free, 2% Inuse PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 1568 root 1 103 0 21M 192K RUN 85:33 100.00% mv 1009 root 1 24 0 23M 0B wait 0:13 0.00% <python> 69488 root 6 24 0 824M 12M uwait 0:06 0.00% vault 87766 root 3 20 0 29M 2800K select 0:04 0.00% vmtoolsd 98489 root 1 20 0 19M 1596K select 0:02 0.00% htop 83894 root 1 20 0 72M 2180K select 0:02 0.00% raid_log_wat 16158 root 1 20 0 81M 304K piperd 0:02 0.00% external_aut 52144 root 1 52 0 80M 24M RUN 0:01 0.00% core_watch 90087 root 1 22 0 20M 188K ufs 0:01 0.00% du 90718 root 1 21 0 16M 188K ufs 0:01 0.00% du 21 root 1 22 0 26M 192K ufs 0:01 0.00% du 99342 root 1 22 0 23M 192K ufs 0:01 0.00% du 99975 root 1 22 0 20M 192K ufs 0:01 0.00% du 99885 root 1 22 0 27M 192K ufs 0:01 0.00% du 425 root 1 22 0 22M 192K ufs 0:01 0.00% du 97421 pgsql 1 20 0 187M 380K select 0:00 0.00% postgres