Bug 200447

Summary: Multiple time-related tests are having issues with Jenkins/Bhyve in the past few months; Jenkins/Bhyve on the cluster is non-performant/less deterministic compared to VMware Fusion 7
Product: Services Reporter: Enji Cooper <ngie>
Component: Testing & CIAssignee: Li-Wen Hsu <lwhsu>
Status: Closed FIXED    
Severity: Affects Many People CC: allanjude, emaste, grehan, mav, neel, testing
Priority: ---    
Version: unspecified   
Hardware: Any   
OS: Any   

Description Enji Cooper freebsd_committer freebsd_triage 2015-05-25 20:53:16 UTC
I originally raised this as an issue with mav@ after a few commits made to bhyve seemed to make the test results less reliable: https://lists.freebsd.org/pipermail/freebsd-testing/2015-April/000943.html .

Going through a number of the recent results, a lot of the issues that keep on popping up are related to time. Some examples are:
- https://jenkins.freebsd.org/job/FreeBSD_HEAD-tests/1058/testReport/junit/lib.libc.gen/sleep_test/kevent/
- https://jenkins.freebsd.org/job/FreeBSD_HEAD-tests/1056/testReport/junit/lib.libc.sys/sigtimedwait_test/sigtimedwait_NULL_timeout/

VMware Fusion 7 with open-vm-tools installed on my Macbook isn't even remotely close execution time
wise to the examples above, so my guess is something's really off with how BHyve's configured on the
Jenkins slaves/cluster, the BHyve host is really CURRENT with WITNESS, the host is overtaxed, something
else is setup incorrectly, or performance in general with BHyve is sucking :(...

$ (cd /usr/tests/lib/libc/gen/; time kyua test sleep_test:kevent)                                                                                                     
sleep_test:kevent  ->  passed  [27.218s]

Results file id is usr_tests_lib_libc_gen.20150525-204448-521721
Results saved to /home/ngie/.kyua/store/results.usr_tests_lib_libc_gen.20150525-204448-521721.db

1/1 passed (0 failed)

real    0m27.854s
user    0m0.025s
sys     0m0.346s

$ (cd /usr/tests/lib/libc/sys/; time kyua test sigtimedwait_test:sigtimedwait_NULL_timeout)
sigtimedwait_test:sigtimedwait_NULL_timeout  ->  passed  [5.011s]

Results file id is usr_tests_lib_libc_sys.20150525-205015-100358
Results saved to /home/ngie/.kyua/store/results.usr_tests_lib_libc_sys.20150525-205015-100358.db

1/1 passed (0 failed)

real    0m5.195s
user    0m0.016s
sys     0m0.017s
Comment 1 Enji Cooper freebsd_committer freebsd_triage 2015-05-25 21:05:19 UTC
Backing storage is on spinning rust. I'm running OSX Mavericks 10.9.5. The machine at the time when those numbers were gathered in comment # 0 is mostly idle-ish.

$ grep Intel /var/run/dmesg.boot  | head -1
CPU: Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz (2000.07-MHz K8-class CPU)
$ sysctl -n hw.physmem                                                                                                                                                   
4267769856
$ sudo camcontrol devlist
<NECVMWar VMware IDE CDR10 1.00>   at scbus1 target 0 lun 0 (cd0,pass0)
<VMware, VMware Virtual S 1.0>     at scbus2 target 0 lun 0 (pass1,da0)
$ grep ^da0 /var/run/dmesg.boot 
da0 at mpt0 bus 0 scbus2 target 0 lun 0
da0: <VMware, VMware Virtual S 1.0> Fixed Direct Access SCSI-2 device
da0: 320.000MB/s transfers (160.000MHz, offset 127, 16bit)
da0: Command Queueing enabled
da0: 40960MB (83886080 512 byte sectors: 255H 63S/T 5221C)
Comment 2 Enji Cooper freebsd_committer freebsd_triage 2015-06-16 04:50:49 UTC
Hmmm... this looks a bit unsettling. From
https://jenkins.freebsd.org/job/FreeBSD_HEAD-tests/1108/consoleFull :

--- syscall (128, FreeBSD ELF64, sys_rename), rip = 0x800897c0a, rsp = 0x7fffffffb958, rbp = 0x7fffffffeac0 ---
passed  [3.183s]
usr.sbin/etcupdate/always_test:main  ->  ahcich0: Timeout on slot 5 port 0
ahcich0: is 00000000 cs 00000000 ss 00000060 rs 00000060 tfd 50 serr 00000000 cmd 1000c617
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 58 c8 27 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
ahcich0: Timeout on slot 16 port 0
ahcich0: is 00000000 cs 00000000 ss 00010000 rs 00010000 tfd 50 serr 00000000 cmd 1000d017
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 a8 cd 27 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
nfs server jenkins-10.freebsd.org:/builds: not responding
nfs server jenkins-10.freebsd.org:/builds: is alive again
broken: Test case timed out  [645.362s]
usr.sbin/etcupdate/conflicts_test:main  ->  ahcich0: Timeout on slot 12 port 0
ahcich0: is 00000000 cs 00000000 ss fffff001 rs fffff001 tfd 50 serr 00000000 cmd 1000c017
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 38 6b 20 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
passed  [203.671s]
usr.sbin/etcupdate/fbsdid_test:main  ->  passed  [24.817s]
usr.sbin/etcupdate/ignore_test:main  ->  ahcich0: Timeout on slot 1 port 0
ahcich0: is 00000000 cs 00000000 ss 00000002 rs 00000002 tfd 50 serr 00000000 cmd 1000c117
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 78 4c 20 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
passed  [67.351s]
usr.sbin/etcupdate/preworld_test:main  ->  passed  [10.769s]
usr.sbin/etcupdate/tests_test:main  ->  ahcich0: Timeout on slot 16 port 0
ahcich0: is 00000000 cs 00000000 ss 00070000 rs 00070000 tfd 50 serr 00000000 cmd 1000d217
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 b0 c2 27 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
ahcich0: Timeout on slot 2 port 0
ahcich0: is 00000000 cs 00000000 ss 00000004 rs 00000004 tfd 50 serr 00000000 cmd 1000c217
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 d0 d7 21 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: Command timeout
(ada0:ahcich0:0:0:0): Retrying command
broken: Test case timed out  [313.478s]
Comment 3 Enji Cooper freebsd_committer freebsd_triage 2015-06-16 05:21:00 UTC
FWIW I'm more convinced at this point that this is being caused by something being served over NFS instead of a local disk/array. Craig, et al: could you please tell me how things are currently mounted/loaded on the Jenkins cluster (mountpoints, how things are backed, how things are written out to NFS, etc)?
Comment 4 Allan Jude freebsd_committer freebsd_triage 2015-06-16 05:53:42 UTC
(In reply to Garrett Cooper,425-314-3911 from comment #2)
That output suggests a failing drive, which would cause all kinds of variance in those tests
Comment 5 Enji Cooper freebsd_committer freebsd_triage 2015-06-16 06:19:01 UTC
(In reply to Allan Jude from comment #4)

We run into these issues all the time when our NFS stores go away for VMware ESXi hosted VMs at $work. That's why I was curious about how things are backed when running the Jenkins jobs.

If things are truly hosted over NFS, that's horrible for performance (especially with NFS v4 when using TCP, which IIRC was the default); the test host needs to use a local disk -- preferably a RAID-1 SSD configuration using either graid or ZFS. The Jenkins job/infrastructure should not test out NFS latency/performance :)...
Comment 6 Enji Cooper freebsd_committer freebsd_triage 2015-07-04 08:10:20 UTC
The issue has been persisting pretty regularly the past 20-30 runs. Example:
https://jenkins.freebsd.org/job/FreeBSD_HEAD-tests/1160/console

The current implementation is a maintenance burden. I'm close to just rewriting the code to use vmware workstation or ESXi at this point instead of bhyve and host my own setup. The jenkins failure emails have been excessive lately -- I'm about ready to kill file all of them.
Comment 7 Peter Grehan freebsd_committer freebsd_triage 2015-07-04 16:07:33 UTC
Who's in charge of this setup ? No-one answered the question about NFS or what the configuration is.

If the system is running without any admin, it seems prudent to at least turn off the emails until the issue an be diagnosed.
Comment 8 Enji Cooper freebsd_committer freebsd_triage 2017-05-08 17:19:10 UTC
ci.freebsd.org has fixed this issue.