Bug 217871 - sys/netinet/fibs_test;slaac_on_nondefault_fib6 fails when run twice
Summary: sys/netinet/fibs_test;slaac_on_nondefault_fib6 fails when run twice
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Some People
Assignee: Alan Somers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-17 15:15 UTC by Alan Somers
Modified: 2017-03-26 19:54 UTC (History)
1 user (show)

See Also:
asomers: mfc-stable10?
asomers: mfc-stable11?


Attachments
Patch for the slaac_on_nondefault_fib6 testcase (1.04 KB, patch)
2017-03-17 23:26 UTC, Alan Somers
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Alan Somers freebsd_committer 2017-03-17 15:15:17 UTC
SLAAC is supposed to both configure an interface and add its routes to the routing table.  Most of the time it succeeds.  However, the test case for BUG196361 revealed occasional failures.  If you configure an epair interface (both sides) immediately after creating it with "ifconfig epair create", sometimes the interface will get configured but not routes will be added.  Workarounds are:

1) Add a 1 second sleep between "ifconfig epair create" and statically configuring the a half of the epair.  It is not sufficient to add the sleep between statically configuring the a half and using SLAAC to configure the B half.

2) Add a longish (precise time unknown, but > 5 seconds) sleep between destroying an epair interface and creating a new one.  This bug has not been observed the first time that an epair is created.

The test case, currently disabled, is
sys/netinet/fibs_test:slaac_on_nondefault_fib6
Comment 1 Alan Somers freebsd_committer 2017-03-17 23:25:17 UTC
The problem seems to be that if you destroy an epair and then recreate it within about 60s, the SLAAC address from the previous (destroyed) interface gets assigned to the newly created interface.  I don't yet know why, but I can demonstrate it by running the fibs_test:slaac_on_nondefault_fib6 twice in a row with attached patch applied.  The patch randomizes the addresses for each iteration.  Note how the second run's failure messaged shows 2001:db8:3325:4cc5:ff:c0ff:fe00:60b assigned to epair0b.  This matches the prefix from the first run, not the prefix from the second run.

$ sudo kyua debug fibs_test:slaac_on_nondefault_fib6 && sudo kyua debug fibs_test:slaac_on_nondefault_fib6
fib is 2
fib is 3
net.inet6.ip6.forwarding: 1 -> 1
net.inet6.ip6.rfc6204w3: 1 -> 1
PREFIX is 2001:db8:3325:4cc5
setfib 2 ifconfig epair0a inet6 2001:db8:3325:4cc5::2/64 fib 2
setfib 3 ifconfig epair0b inet6 -ifdisabled accept_rtadv fib 3 up
Executing command [ ifconfig epair0b ]
Executing command [ netstat -rnf inet6 -F 3 ]
Executing command [ netstat -rnf inet6 -F 3 ]
Executing command [ netstat -rnf inet6 -F 3 ]
Executing command [ netstat -rnf inet6 -F 0 ]
Executing command [ netstat -rnf inet6 -F 0 ]
Executing command [ netstat -rnf inet6 -F 0 ]
Executing command [ netstat -rnf inet6 -F 1 ]
Executing command [ netstat -rnf inet6 -F 1 ]
Executing command [ netstat -rnf inet6 -F 1 ]
ifconfig epair0a destroy
net.inet6.ip6.rfc6204w3: 1 -> 1
net.inet6.ip6.forwarding: 1 -> 1
fibs_test:slaac_on_nondefault_fib6  ->  passed
fib is 2
fib is 3
net.inet6.ip6.forwarding: 1 -> 1
net.inet6.ip6.rfc6204w3: 1 -> 1
PREFIX is 2001:db8:78e6:5bce
setfib 2 ifconfig epair0a inet6 2001:db8:78e6:5bce::2/64 fib 2
setfib 3 ifconfig epair0b inet6 -ifdisabled accept_rtadv fib 3 up
Executing command [ ifconfig epair0b ]
ifconfig epair0a destroy
net.inet6.ip6.rfc6204w3: 1 -> 1
net.inet6.ip6.forwarding: 1 -> 1
Fail: regexp inet6 2001:db8:78e6:5bce:.*prefixlen 64.*autoconf not in stdout
epair0b: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8<VLAN_MTU>
        ether 02:ff:c0:00:06:0b
        inet6 fe80::ff:c0ff:fe00:60b%epair0b prefixlen 64 scopeid 0x6 
        inet6 2001:db8:3325:4cc5:ff:c0ff:fe00:60b prefixlen 64 tentative detached autoconf 
        nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        fib: 3
        groups: epair 
Files left in work directory after failure: forwarding.state, ifaces_to_cleanup, rfc6204w3.state, rtadvd.pid, rtadvd.sock
fibs_test:slaac_on_nondefault_fib6  ->  failed: atf-check failed; see the output of the test for details
Comment 2 Alan Somers freebsd_committer 2017-03-17 23:26:27 UTC
Created attachment 180917 [details]
Patch for the slaac_on_nondefault_fib6 testcase
Comment 3 Alan Somers freebsd_committer 2017-03-20 21:10:49 UTC
In https://reviews.freebsd.org/D9451, jhujhiti noticed that the IFDISABLED flag doesn't get added to epair0b immediately upon creation, but shortly after.  He suspected that was the cause of the problem.  However, it's not.  It's just red herring.  The IFDISABLED flag gets added by devd, which runs pccard_ether, which runs "service netif quietstart" on the new interfaces.  Disabling devd does not fix the test.  On the second interface creation, the address from the previous run is still present.
Comment 4 Alan Somers freebsd_committer 2017-03-20 23:08:23 UTC
Turned out to be a bug in the test, not the kernel.  Fixed in r315656
Comment 5 commit-hook freebsd_committer 2017-03-20 23:08:28 UTC
A commit references this bug:

Author: asomers
Date: Mon Mar 20 23:07:35 UTC 2017
New revision: 315656
URL: https://svnweb.freebsd.org/changeset/base/315656

Log:
  Fix back-to-back runs of sys/netinet/fibs_test;slaac_on_nondefault_fib6

  This test was failing if run twice because rtadvd takes too long to die.
  The rtadvd process from the first run was still running when the
  second run created its interfaces.  The solution is to use SIGKILL during
  the cleanup instead of SIGTERM so rtadvd will die faster.

  While I'm here, randomize the addresses used for the test, which makes bugs
  like this easier to spot, and fix the cleanup order to be the opposite of
  the setup order

  PR:		217871
  MFC after:	18 days
  X-MFC-With:	315458
  Sponsored by:	Spectra Logic Corp

Changes:
  head/tests/sys/netinet/fibs_test.sh
Comment 6 jhujhiti 2017-03-23 22:10:35 UTC
This is a good find, and it does fix rapid test runs, but unfortunately the original issue I had while developing persists. The rtsol in the test times out unless I insert a sleep like so:

        # Configure epair interfaces
        get_epair
        sleep 1
        setup_iface "$EPAIRA" "$FIB0" inet6 ${ADDR} ${MASK}
        echo setfib $FIB1 ifconfig "$EPAIRB" inet6 -ifdisabled accept_rtadv fib $FIB1 up
        setfib $FIB1 ifconfig "$EPAIRB" inet6 -ifdisabled accept_rtadv fib $FIB1 up

This sleep after epair creation is enough to fix it consistently. Moving the sleep down one line, below setup_iface, does not fix it. So it would seem that the issue is with the router interface rather than the client interface. I'm a bit puzzled as to the cause... It seems to be an issue prior to initializing inet6 on the interface at all.
Comment 7 Alan Somers freebsd_committer 2017-03-23 22:17:09 UTC
How do you reproduce the failure now?  I haven't seen any failures since 315656.
Comment 8 jhujhiti 2017-03-23 22:30:46 UTC
(In reply to Alan Somers from comment #7)

I'm simply running the latest CURRENT (well, master on the github mirror) unmodified. GENERIC kernel with the following /boot/loader.conf:

kern.geom.label.gptid.enable="0"
zfs_load="YES"
coretemp_load="YES"
if_epair_load="YES"
net.fibs=3
net.add_addr_allfibs=0
boot_multicons="YES"
boot_serial="YES"
comconsole_speed="57600"
console="comconsole,vidconsole"

The git HEAD it's running right now is c83649c43c7 which looks like r315762 in SVN.
Comment 9 Alan Somers freebsd_committer 2017-03-23 22:36:51 UTC
But what command are you running?  "kyua test" of the entire directory or just that single test, or something else?  How often do you need to run it before it fails?
Comment 10 jhujhiti 2017-03-23 22:59:23 UTC
(In reply to Alan Somers from comment #9)

Ah, of course :)

kyua test -k /usr/tests/sys/netinet/Kyuafile

It fails right after boot, and consistently after that. rtsol tries for 10 seconds and will send multiple router solicitations and not get a reply to any of them.

If I remove the sleep and insert ifconfig "$EPAIRA" like this...

# Configure epair interfaces
get_epair
ifconfig "$EPAIRA"
setup_iface "$EPAIRA" "$FIB0" inet6 ${ADDR} ${MASK}
ifconfig "$EPAIRA"

... the only unusual thing I see is a duplicate ether address that disappears after configuration:

Standard output:
fib is 1
fib is 2
net.inet6.ip6.forwarding: 0 -> 1
net.inet6.ip6.rfc6204w3: 0 -> 1
### first ifconfig
epair0a: flags=8842<BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8<VLAN_MTU>
        ether 02:00:c0:00:04:0a
        ether 02:00:c0:00:04:0a
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        groups: epair 
setfib 1 ifconfig epair0a inet6 2001:db8:3e0d:a5a3::2/64 fib 1
### second ifconfig
epair0a: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8<VLAN_MTU>
        ether 02:00:c0:00:04:0a
        inet6 2001:db8:3e0d:a5a3::2 prefixlen 64 tentative 
        inet6 fe80::c0ff:fe00:40a%epair0a prefixlen 64 tentative scopeid 0x4 
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        fib: 1
        groups: epair 
setfib 2 ifconfig epair0b inet6 -ifdisabled accept_rtadv fib 2 up
Executing command [ ifconfig epair0b ]
ifconfig epair0a destroy
net.inet6.ip6.forwarding: 1 -> 0
net.inet6.ip6.rfc6204w3: 1 -> 0

Standard error:
Fail: regexp inet6 2001:db8:3e0d:a5a3:.*prefixlen 64.*autoconf not in stdout
epair0b: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=8<VLAN_MTU>
        ether 02:ff:c0:00:05:0b
        inet6 fe80::ff:c0ff:fe00:50b%epair0b prefixlen 64 scopeid 0x5 
        nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
        media: Ethernet 10Gbase-T (10Gbase-T <full-duplex>)
        status: active
        fib: 2
        groups: epair 

My initial thought was that this is a DAD issue, but if that were the case, a sleep after adding the inet6 address should be what fixes it, rather than before.
Comment 11 Alan Somers freebsd_committer 2017-03-24 15:00:32 UTC
I just upgraded to r315874 and decreased net.fibs from 4 to 3 but I still can't reproduce your failure.
Comment 12 jhujhiti 2017-03-24 16:25:24 UTC
(In reply to Alan Somers from comment #11)

I'll rebuild my test machine from scratch over the weekend and see if it goes away.
Comment 13 jhujhiti 2017-03-26 19:54:33 UTC
Good news and bad news: The issue happens on a fresh install (11.0-RELEASE-p1 -> Subversion head), but I've somewhat isolated it. This simple script fails consistently - epair0b does not get an address:

ifconfig epair0 create
setfib 1 ifconfig epair0a inet6 2001:db8::1/64 fib 1
setfib 2 ifconfig epair0b inet6 -ifdisabled accept_rtadv fib 2 up
rtadvd -p rtadvd.pid -C rtadvd.sock -c /dev/null epair0a
rtsol epair0b
ifconfig epair0b
pkill -kill -F rtadvd.pid
rm -f rtadvd.pid rtadvd.sock
ifconfig epair0a destroy

If I remove "setfib 1", on the first run after boot, rtsol will complain that epair0b is disabled (and the subsequent ifconfig does show that the IFDISABLED flag is never removed from nd6 options, odd...), but epair0b successfully gets an address on every execution of the script after that.

While this is pretty clearly some kind of bug, could the test case avoid it by not calling setfib in setup_iface? I'm not clear on why setfib is used given that the ifconfig always uses the fib argument too.