Bug 272434 - Hangs with combined use of du and disk operations (like mv,cp,shutil.move/rmtree,os.rename)
Summary: Hangs with combined use of du and disk operations (like mv,cp,shutil.move/rmt...
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.2-RELEASE
Hardware: Any Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords: needs-qa
Depends on:
Blocks:
 
Reported: 2023-07-09 12:44 UTC by shamsher
Modified: 2023-09-26 07:17 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description shamsher 2023-07-09 12:44:04 UTC
Hi,
We were performing file system operations using du,cp,mv,shutil.move/rmtree and os.rename. Observed one of the directory is going for toss in 13.1 and 13.2 (also checked in iso of 13.0,12.4 same issues observed). Yet to verify in current.

rtestuser@freeBSD13_2:~ $ uname -a
FreeBSD freeBSD13_2.cs39 13.2-RELEASE FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC amd64
rtestuser@freeBSD13_2:~ $ cd /data/infra/
rtestuser@freeBSD13_2:/data/infra $ ps -aux | grep python
root        726   0.0  9.1 23100 8768 v0  I    14:34   14:51.95 python move-move-delete.py (python3.9)
root       3332   0.0  7.3 18800 7028 v0  I    14:35    0:00.03 python du.py (python3.9)
root      23350   0.0  7.3 18800 7028 v0  I    15:07    0:00.02 python du.py (python3.9)
rtestuser 48818   0.0  2.1 12812 1984  0  S+   15:53    0:00.00 grep python
rtestuser@freeBSD13_2:/data/infra $ ls
createfiles.py		dir1			dir4_ren		du.py			move-move-delete.py
rtestuser@freeBSD13_2:/data/infra $ cat move-move-delete.py 
import os
import shutil
import time
path2 = "/data/infra/dir2_ren"
path3 = "/data/infra/dir3_ren"
path4 = "/data/infra/dir4_ren"
while 1:
    os.rename("/data/infra/dir1/","/data/infra/dir2_ren/")
    os.system("ls -ir /data/infra/")
    print("rename done")
    os.mkdir(path3)
    files = [f for f in os.listdir("/data/infra/dir2_ren/")]
    for file in files:
        old = "/data/infra/dir2_ren/" + file
        ipath = "/data/infra/dir2_ren/" + file + "/"
        files1 = [af for af in os.listdir(ipath)]
        for ff in files1:
            fpath = "/data/infra/dir2_ren/" + file + "/" + ff   
            #keeping files opened to check it further
            p = open(fpath,"w")
            p1 = open(fpath,"w")
            print("I AM WRITTING")
            p.write("I am writting now!")
        new = "/data/infra/dir3_ren/" + file
        shutil.move(old,new)  
    print("shutil.move done")
    shutil.rmtree("/data/infra/dir2_ren")
    print("deleted dir2_ren")
    os.system("ls -ir /data/infra/")
    os.mkdir(path4)
    os.system("mv /data/infra/dir3_ren/* /data/infra/dir4_ren/")     
    print("os mv command success")
    shutil.rmtree("/data/infra/dir3_ren")
    print("deleted dir3_ren")
    print("after dir3_ren")
    os.system("ls -ir /data/infra/")
    os.system("mkdir /data/infra/dir1")
    os.system("cp -rf /data/infra/dir4_ren/* /data/infra/dir1/")     
    print("dir1 created back and copied data from dir4 to dir1")
    shutil.rmtree("/data/infra/dir4_ren")
rtestuser@freeBSD13_2:/data/infra $ cat du.py 
import os
import time;
while 1:
    os.system("du /")
    time.sleep(10)
rtestuser@freeBSD13_2:/data/infra $ ls -lrt dir1/
total 0
rtestuser@freeBSD13_2:/data/infra $ ls -lrt dir4_ren/


^C


Note: In next run got stuck in shutil.move.

rtestuser@freeBSD13_2:~ $ uname -r
13.2-RELEASE
rtestuser@freeBSD13_2:~ $ uname -a
FreeBSD freeBSD13_2.cs39 13.2-RELEASE FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC amd64
rtestuser@freeBSD13_2:~ $ top


last pid: 76719;  load averages:  2.02,  1.48,  1.04                                                                                          up 0+00:27:10  18:00:02
28 processes:  3 running, 25 sleeping
CPU: 19.5% user,  0.0% nice, 44.8% system,  2.4% interrupt, 33.3% idle
Mem: 25M Active, 7184K Inact, 2388K Laundry, 54M Wired, 12M Buf, 3496K Free
Swap: 723M Total, 13M Used, 710M Free, 1% Inuse

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
60795 root          1 103    0    22M  8908K RUN      3:23 100.00% python3.9
 4544 rtestuser     1  20    0    21M  4576K select   0:00   0.00% sshd
76693 root          1  37    0    15M  2000K ufs      0:00   0.00% du
76694 root          1  36    0    15M  2000K ufs      0:00   0.00% du
76662 rtestuser     1  20    0    15M  4700K ufs      0:00   0.00% du
  758 root          1  20    0    16M  1540K ttyin    0:00   0.00% csh
 8447 root          1  20    0    18M  6888K wait     0:00   0.00% python3.9
  764 root          1  20    0    18M  3764K wait     0:00   0.00% python3.9
  702 root          1  20    0    18M  2172K select   0:00   0.00% sendmail
76699 root          1  21    0    21M  7924K select   0:00   0.00% sshd
 4396 root          1  21    0    21M  4452K select   0:00   0.00% sshd
 4550 rtestuser     1  21    0    13M  2464K wait     0:00   0.00% sh
  596 root          1  20    0    13M  1412K select   0:00   0.00% syslogd
76703 rtestuser     1  20    0    13M  2852K wait     0:00   0.00% sh
....

rtestuser@freeBSD13_2:/data/infra $ ls 
createfiles.py		dir2_ren		dir3_ren		du.py			move-move-delete.py
rtestuser@freeBSD13_2:/data/infra $ ls dir3_ren/
dir0	dir1	dir10	dir11	dir12	dir13	dir14	dir15
rtestuser@freeBSD13_2:/data/infra $ ls dir2_ren


^C
^C

System config: 128MB RAM, CPU core -2, Disk-250GB.
Thanks & regards,
Shamsher
Comment 1 shamsher 2023-07-09 12:46:25 UTC
Reason for less RAM. With high RAM values (6/8GB) we were seeing same issues in months time. There to repro it quickly we have reduced RAM to 128MB to get issues in approx 30-minutes time.
Comment 2 shamsher 2023-07-09 12:49:14 UTC
Note: No issues seen in FreeBSD10.4.
Comment 3 Yuri Pankov freebsd_committer freebsd_triage 2023-07-09 13:34:30 UTC
So where exactly does the process hang?

procstat kstack <pid>
Comment 4 shamsher 2023-07-10 07:03:44 UTC
root@freeBSD13_2:/data/infra # top

last pid:  3279;  load averages:  2.77,  2.29,  1.33                                                                                          up 0+00:12:04  12:32:42
22 processes:  4 running, 18 sleeping
CPU:  1.3% user,  0.0% nice, 74.7% system,  0.1% interrupt, 23.9% idle
Mem: 30M Active, 8620K Inact, 1760K Laundry, 48M Wired, 8660K Buf, 2428K Free
Swap: 723M Total, 128K Used, 723M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
  794 root          1  95    0    22M  8692K RUN      0:43  56.98% python3.9
 3147 root          1 102    0    15M  4548K RUN      8:18  40.97% du
  786 root          1  20    0    21M  5812K RUN      0:00   0.00% sshd
  733 root          1  20    0    16M  1788K pause    0:00   0.00% csh
  789 root          1  20    0    16M  3172K pause    0:00   0.00% csh
  703 root          1  20    0    18M  4196K select   0:00   0.00% sendmail
 1640 root          1  21    0    18M  6968K wait     0:00   0.00% python3.9
  725 root          1  20    0    13M  8192B wait     0:00   0.00% <login>
  597 root          1  20    0    13M  2028K select   0:00   0.00% syslogd
  710 root          1  20    0    13M   648K nanslp   0:00   0.00% cron
  400 root          1  20    0    11M   672K select   0:00   0.00% devd
  785 root          1  20    0    21M  4644K select   0:00   0.00% sshd
 3274 root          1  20    0    13M  2076K ufs      0:00   0.00% ls
  732 root          1  52    0    13M  1612K ttyin    0:00   0.00% getty
  730 root          1  52    0    13M  1616K ttyin    0:00   0.00% getty
  729 root          1  52    0    13M  1624K ttyin    0:00   0.00% getty
  731 root          1  52    0    13M  1616K ttyin    0:00   0.00% getty
  728 root          1  52    0    13M  1612K ttyin    0:00   0.00% getty
  726 root          1  52    0    13M  1616K ttyin    0:00   0.00% getty
  727 root          1  52    0    13M  1616K ttyin    0:00   0.00% getty
  706 smmsp         1  52    0    18M  8192B pause    0:00   0.00% <sendmail>
 3279 root          1  20    0    14M  2604K RUN      0:00   0.00% top



root@freeBSD13_2:/data/infra # ps auxwww | grep 3147
root  3147 48.0  4.7 15368 4548  0  R    12:23   8:20.34 du /
root  3281  0.0  0.3   436  256  0  R+   12:32   0:00.00 grep 3147
root@freeBSD13_2:/data/infra # procstat kstack 3147
  PID    TID COMM                TDNAME              KSTACK                       
 3147 100116 du                  -                   mi_switch intr_event_handle intr_execute_handlers Xapic_isr1 ffs_lock _vn_lock vget_finish vfs_hash_get ffs_vgetf ufs_lookup_ino vfs_cache_lookup VOP_LOOKUP cache_fplookup_noentry cache_fplookup namei kern_statat sys_fstatat amd64_syscall 
root@freeBSD13_2:/data/infra # uname -a
FreeBSD freeBSD13_2.cs39 13.2-RELEASE FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC amd64
root@freeBSD13_2:/data/infra #
Comment 5 shamsher 2023-07-10 07:15:16 UTC
hang is random. 
Sometimes it hangs at du + (mv/shuril.move/cp). But most of the time it's hanging at mv + du. 
Note: du need to run is separate process.
Comment 6 shamsher 2023-07-10 12:16:42 UTC
Also just for info when replacing df from du then not seeing hang in any folder.
Looks with df combination it's working fine but not with du combination.

$ cat du.py 
import os
import time;
while 1:
    os.system("du /")     <<-- here replaced du with df
    time.sleep(5)

note: dir1: subdir + file have been created inside /data/infra folder as below:
$ cat createfiles.py 
import os
i = 0
import time
while i < 50:
    j = 0
    path = "/data/infra/dir1/dir" + str(i)
    os.makedirs(path)
    while j < 10:
        path1 = "/data/infra/dir1/dir" + str(i) + "/file" + str(j)
#        cmd = "touch " + path1
#        os.system(cmd)
        f = open(path1,"w")
        f.write("Hit the issue")
        j += 1
    i+=1

Steps : 1- create dir : /data/infra
        2- inside above folder execute "python createfiles.py"
        3- run "python mode-move-delete.py >/dev/null & "
        4- run "python du.py > /dev/null &"
Wait for sometime issues will get hit in one of the folder.
Sys config/resource- cpu:1, RAM:128M, Disk:250GB
Comment 7 shamsher 2023-09-02 06:47:26 UTC
We have loaded FreeBSD13.2 with ZFS file system also.
Observed Result is same.
Note: For ZFS system, RAM used as 192MB.
Comment 8 shamsher 2023-09-04 10:42:25 UTC
same hang issues seeing using diskus also. :(
Comment 9 shamsher 2023-09-16 13:36:01 UTC
lock order reversal:
 1st 0xfffff800028677f0 ufs (ufs, lockmgr) @ kern/vfs_mount.c:1742
 2nd 0xfffff8000286dcf0 devfs (devfs, lockmgr) @ kern/vfs_subr.c:3191
lock order devfs -> ufs established at:
#0 0xffffffff80682cbd at witness_checkorder+0x46d
#1 0xffffffff805eacf2 at lockmgr_lock_flags+0x172
#2 0xffffffff808617fd at ffs_lock+0x9d
#3 0xffffffff809bf090 at VOP_LOCK1_APV+0x40
#4 0xffffffff80711cd4 at _vn_lock+0x54
#5 0xffffffff806ef8fd at vfs_domount+0xf1d
#6 0xffffffff806edd02 at vfs_donmount+0x872
#7 0xffffffff806f24b7 at kernel_mount+0x57
#8 0xffffffff806f4e61 at parse_mount+0x4a1
#9 0xffffffff806f32e7 at vfs_mountroot+0x587
#10 0xffffffff805a9adf at start_init+0x1f
#11 0xffffffff805d5750 at fork_exit+0x80
#12 0xffffffff80916f1e at fork_trampoline+0xe
lock order ufs -> devfs attempted at:
#0 0xffffffff8068361c at witness_checkorder+0xdcc
#1 0xffffffff805eacf2 at lockmgr_lock_flags+0x172
#2 0xffffffff809bf090 at VOP_LOCK1_APV+0x40
#3 0xffffffff80711cd4 at _vn_lock+0x54
#4 0xffffffff806f9ed6 at vput_final+0x116
#5 0xffffffff8085f4ef at ffs_unmount+0x33f
#6 0xffffffff806f09eb at dounmount+0x43b
#7 0xffffffff806f053c at kern_unmount+0x2fc
#8 0xffffffff80943892 at amd64_syscall+0x122
#9 0xffffffff8091683e at fast_syscall_common+0xf8

can we change below value? looks like this having some impact:
debug.vn_lock_pair_pause_max
Comment 10 shamsher 2023-09-19 09:27:43 UTC
do u have RAG for vn_lock_pair(...) (freebsd-src/sys/kern/vfs_vnops.c) ?
Comment 11 shamsher 2023-09-24 13:48:02 UTC
tried with debug.vn_lock_pair_pause_max=10 also with 128MB. in stead of ~30min; process got hanged after ~5days :(
Comment 12 shamsher 2023-09-26 07:15:01 UTC
db> show lockedvnods
Locked vnodes
vnode 0xfffff800060db780: type VDIR
    usecount 9, writecount 0, refcount 2 seqc users 0 mountedhere 0
    hold count flags ()
    flags (VMP_LAZYLIST)
    v_object 0xfffff80004000c60 ref 0 pages 0 cleanbuf 0 dirtybuf 1
    lock type ufs: SHARED (count 1)
 with exclusive waiters pending
 with shared waiters pending
#0 0xffffffff805ead55 at lockmgr_lock_flags+0x1d5
#1 0xffffffff808619fd at ffs_lock+0x9d
#2 0xffffffff809bf090 at VOP_LOCK1_APV+0x40
#3 0xffffffff80711cd4 at _vn_lock+0x54
#4 0xffffffff806f97aa at vget_finish+0x7a
#5 0xffffffff806e8d7f at vfs_hash_get+0xbf
#6 0xffffffff8085cb88 at ffs_vgetf+0x58
#7 0xffffffff808697fa at ufs_lookup_ino+0x102a
#8 0xffffffff809bbb66 at VOP_CACHEDLOOKUP_APV+0x56
#9 0xffffffff806da636 at vfs_cache_lookup+0xa6
#10 0xffffffff809bb9b6 at VOP_LOOKUP_APV+0x56
#11 0xffffffff806eb431 at lookup+0x481
#12 0xffffffff806ea786 at namei+0x4d6
#13 0xffffffff8070d7f9 at kern_renameat+0x109
#14 0xffffffff80943eba at amd64_syscall+0x74a
#15 0xffffffff80916b1e at fast_syscall_common+0xf8
	nlink=52, effnlink=10, size=1024, extsize 0
	generation=d605dcda, uid=0, gid=0, flags=0x0
	ino 1218572, on dev da0p9
vnode 0xfffff80002f00780: type VDIR
    usecount 0, writecount 0, refcount 2 seqc users 1 mountedhere 0
    hold count flags ()
    flags (VIRF_DOOMED)
    v_object 0xfffff80003ff4a50 ref 0 pages 0 cleanbuf 0 dirtybuf 0
    lock type ufs: EXCL by thread 0xfffffe0057960800 (pid 1568, mv, tid 100157)
#0 0xffffffff805ead55 at lockmgr_lock_flags+0x1d5
#1 0xffffffff808619fd at ffs_lock+0x9d
#2 0xffffffff809bf090 at VOP_LOCK1_APV+0x40
#3 0xffffffff80711cd4 at _vn_lock+0x54
#4 0xffffffff806f97aa at vget_finish+0x7a
#5 0xffffffff806e8d7f at vfs_hash_get+0xbf
#6 0xffffffff8085cb88 at ffs_vgetf+0x58
#7 0xffffffff80869a9c at ufs_lookup_ino+0x12cc
#8 0xffffffff809bbb66 at VOP_CACHEDLOOKUP_APV+0x56
#9 0xffffffff806da636 at vfs_cache_lookup+0xa6
#10 0xffffffff809bb9b6 at VOP_LOOKUP_APV+0x56
#11 0xffffffff806eb431 at lookup+0x481
#12 0xffffffff806ea786 at namei+0x4d6
#13 0xffffffff8070d7f9 at kern_renameat+0x109
#14 0xffffffff80943eba at amd64_syscall+0x74a
#15 0xffffffff80916b1e at fast_syscall_common+0xf8
	nlink=3, effnlink=2, size=512, extsize 0
	generation=77c90b5e, uid=0, gid=0, flags=0x0
	ino 1219839, on dev da0p9
db>
Comment 13 shamsher 2023-09-26 07:17:05 UTC
last pid: 81732;  load averages:  2.86,  2.92,  2.89    up 0+01:51:23  06:29:09
42 processes:  3 running, 37 sleeping, 2 zombie
CPU:  0.9% user,  0.0% nice, 94.1% system,  0.7% interrupt,  4.3% idle
Mem: 27M Active, 2692K Inact, 5164K Laundry, 61M Wired, 20M Buf, 2704K Free
Swap: 6144M Total, 164M Used, 5980M Free, 2% Inuse

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
 1568 root          1 103    0    21M   192K RUN     85:33 100.00% mv
 1009 root          1  24    0    23M     0B wait     0:13   0.00% <python>
69488 root          6  24    0   824M    12M uwait    0:06   0.00% vault
87766 root          3  20    0    29M  2800K select   0:04   0.00% vmtoolsd
98489 root          1  20    0    19M  1596K select   0:02   0.00% htop
83894 root          1  20    0    72M  2180K select   0:02   0.00% raid_log_wat
16158 root          1  20    0    81M   304K piperd   0:02   0.00% external_aut
52144 root          1  52    0    80M    24M RUN      0:01   0.00% core_watch
90087 root          1  22    0    20M   188K ufs      0:01   0.00% du
90718 root          1  21    0    16M   188K ufs      0:01   0.00% du
   21 root          1  22    0    26M   192K ufs      0:01   0.00% du
99342 root          1  22    0    23M   192K ufs      0:01   0.00% du
99975 root          1  22    0    20M   192K ufs      0:01   0.00% du
99885 root          1  22    0    27M   192K ufs      0:01   0.00% du
  425 root          1  22    0    22M   192K ufs      0:01   0.00% du
97421 pgsql         1  20    0   187M   380K select   0:00   0.00% postgres