Bug 220004

Summary: nfsv4: vflush failure/nfsv4 client/server error (IFLIB)
Product: Base System Reporter: O. Hartmann <ohartmann>
Component: kernAssignee: freebsd-bugs (Nobody) <bugs>
Status: New ---    
Severity: Affects Many People CC: cem, pi, sbruno, shurd
Priority: ---    
Version: CURRENT   
Hardware: Any   
OS: Any   
Bug Depends on: 219428    
Bug Blocks:    

Description O. Hartmann 2017-06-15 08:22:04 UTC

After IFLIB has been introduced a while ago, our Intel-NIC based server/client systems suffer from corrupted network connectivity. While this problem has introduced itself first on em() and then igb() devices, it seems to be most prominet on em() devices (but also igb()).

Both server and client are FreeBSD CURRENT (server: FreeBSD 12.0-CURRENT #15 r319965: Thu Jun 15 05:56:12 CEST 2017 amd64, client is FreeBSD 12.0-CURRENT #20 r319934: Wed Jun 14 06:18:46 CEST 2017 amd64 at this moment).

Symptom: While syncing large amounts of data via rsync in context of a NFSv4 client/server infrastructure (both sides, server and client, are 12-CURRENT of most recent revision), the connection from the client to the server gets corrupted and dies. 

A while ago, shortly after the introduction of IFLIB and the occurence of this bug desaster, the connection from the client side (in the most prominent case an Intel i217-LM (class=0x020000 card=0x11ed1734 chip=0x153a8086 rev=0x05 hdr=0x0) NIC of a Fujitsu Celsius M740 workstation), it was possible to revive the connection by putting down and bringing up the device via "ifconfig em0 down/up". In the progress of the development, we reached a state where this procedure ends up in a total loss of the NIC in some kind and it wasn't possible to revive the NIC in anyway but the reboot of the system. On a couple of servers equipted with Intels i350 dual-port NICs I was also able to reproduce such a situation and failure by dd'ing a large amount of data from a i350-equipted client to a i350 equipted pendant server (NFSv4 both sides, FreeBSD 12-CURRENT both sides). That was a couple of weeks ago.

Now the situation has grown further nasty.

While syncing large amounts of data between a FreeBSD 12-CURRENT client and a server (most recent OS versions as depicted above, client mounts NFSv4 share via autofs, which is built-in into the kernel), which is a large poudriere repository built on the creator machine and then synced via rsync to the repository-delivering host, the connection gets terminated on the client side. 

This time, the NIC on the client seems to be alive, because I can ping other hosts. Hitting "Ctrl-T" on the terminal from which I initiated the syncing process, I get either 

load: 0.34  cmd: rsync 13468 [nfsaio] 101.43r 0.00u 29.36s 0% 2089k

or, after I tried to restart automount on client-side as well as mountd on the server side

load: 0.22  cmd: rsync 23467 [nfsreq] 2364.43r 0.00u 39.11s 0% 3076k.

The console on the client shows

WARNING: autofs_unmount: vflush failed with error 16
nfsv4 client/server protocol prob err=10020
nfsv4 client/server protocol prob err=10020
nfsv4 client/server protocol prob err=10020

So the most intuitive procedure to reset the connection fails and I'm stuck.
Please see also Bug 219428.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2017-06-15 18:37:31 UTC
Prod sbruno in case this is actually related to changes in the network driver(s).