Summary: | Multiple time-related tests are having issues with Jenkins/Bhyve in the past few months; Jenkins/Bhyve on the cluster is non-performant/less deterministic compared to VMware Fusion 7 | ||
---|---|---|---|
Product: | Services | Reporter: | Enji Cooper <ngie> |
Component: | Testing & CI | Assignee: | Li-Wen Hsu <lwhsu> |
Status: | Closed FIXED | ||
Severity: | Affects Many People | CC: | allanjude, emaste, grehan, mav, neel, testing |
Priority: | --- | ||
Version: | unspecified | ||
Hardware: | Any | ||
OS: | Any |
Description
Enji Cooper
2015-05-25 20:53:16 UTC
Backing storage is on spinning rust. I'm running OSX Mavericks 10.9.5. The machine at the time when those numbers were gathered in comment # 0 is mostly idle-ish. $ grep Intel /var/run/dmesg.boot | head -1 CPU: Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz (2000.07-MHz K8-class CPU) $ sysctl -n hw.physmem 4267769856 $ sudo camcontrol devlist <NECVMWar VMware IDE CDR10 1.00> at scbus1 target 0 lun 0 (cd0,pass0) <VMware, VMware Virtual S 1.0> at scbus2 target 0 lun 0 (pass1,da0) $ grep ^da0 /var/run/dmesg.boot da0 at mpt0 bus 0 scbus2 target 0 lun 0 da0: <VMware, VMware Virtual S 1.0> Fixed Direct Access SCSI-2 device da0: 320.000MB/s transfers (160.000MHz, offset 127, 16bit) da0: Command Queueing enabled da0: 40960MB (83886080 512 byte sectors: 255H 63S/T 5221C) Hmmm... this looks a bit unsettling. From https://jenkins.freebsd.org/job/FreeBSD_HEAD-tests/1108/consoleFull : --- syscall (128, FreeBSD ELF64, sys_rename), rip = 0x800897c0a, rsp = 0x7fffffffb958, rbp = 0x7fffffffeac0 --- passed [3.183s] usr.sbin/etcupdate/always_test:main -> ahcich0: Timeout on slot 5 port 0 ahcich0: is 00000000 cs 00000000 ss 00000060 rs 00000060 tfd 50 serr 00000000 cmd 1000c617 (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 58 c8 27 40 00 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Retrying command ahcich0: Timeout on slot 16 port 0 ahcich0: is 00000000 cs 00000000 ss 00010000 rs 00010000 tfd 50 serr 00000000 cmd 1000d017 (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 a8 cd 27 40 00 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Retrying command nfs server jenkins-10.freebsd.org:/builds: not responding nfs server jenkins-10.freebsd.org:/builds: is alive again broken: Test case timed out [645.362s] usr.sbin/etcupdate/conflicts_test:main -> ahcich0: Timeout on slot 12 port 0 ahcich0: is 00000000 cs 00000000 ss fffff001 rs fffff001 tfd 50 serr 00000000 cmd 1000c017 (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 38 6b 20 40 00 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Retrying command passed [203.671s] usr.sbin/etcupdate/fbsdid_test:main -> passed [24.817s] usr.sbin/etcupdate/ignore_test:main -> ahcich0: Timeout on slot 1 port 0 ahcich0: is 00000000 cs 00000000 ss 00000002 rs 00000002 tfd 50 serr 00000000 cmd 1000c117 (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 78 4c 20 40 00 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Retrying command passed [67.351s] usr.sbin/etcupdate/preworld_test:main -> passed [10.769s] usr.sbin/etcupdate/tests_test:main -> ahcich0: Timeout on slot 16 port 0 ahcich0: is 00000000 cs 00000000 ss 00070000 rs 00070000 tfd 50 serr 00000000 cmd 1000d217 (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 b0 c2 27 40 00 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Retrying command ahcich0: Timeout on slot 2 port 0 ahcich0: is 00000000 cs 00000000 ss 00000004 rs 00000004 tfd 50 serr 00000000 cmd 1000c217 (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 d0 d7 21 40 00 00 00 00 00 00 (ada0:ahcich0:0:0:0): CAM status: Command timeout (ada0:ahcich0:0:0:0): Retrying command broken: Test case timed out [313.478s] FWIW I'm more convinced at this point that this is being caused by something being served over NFS instead of a local disk/array. Craig, et al: could you please tell me how things are currently mounted/loaded on the Jenkins cluster (mountpoints, how things are backed, how things are written out to NFS, etc)? (In reply to Garrett Cooper,425-314-3911 from comment #2) That output suggests a failing drive, which would cause all kinds of variance in those tests (In reply to Allan Jude from comment #4) We run into these issues all the time when our NFS stores go away for VMware ESXi hosted VMs at $work. That's why I was curious about how things are backed when running the Jenkins jobs. If things are truly hosted over NFS, that's horrible for performance (especially with NFS v4 when using TCP, which IIRC was the default); the test host needs to use a local disk -- preferably a RAID-1 SSD configuration using either graid or ZFS. The Jenkins job/infrastructure should not test out NFS latency/performance :)... The issue has been persisting pretty regularly the past 20-30 runs. Example: https://jenkins.freebsd.org/job/FreeBSD_HEAD-tests/1160/console The current implementation is a maintenance burden. I'm close to just rewriting the code to use vmware workstation or ESXi at this point instead of bhyve and host my own setup. The jenkins failure emails have been excessive lately -- I'm about ready to kill file all of them. Who's in charge of this setup ? No-one answered the question about NFS or what the configuration is. If the system is running without any admin, it seems prudent to at least turn off the emails until the issue an be diagnosed. ci.freebsd.org has fixed this issue. |