Bug 127051 - [hme] hme interfaces "pause" with the message "device timeout" on FreeBSD 7.0/sparc64 on an Enterprise 220R
Summary: [hme] hme interfaces "pause" with the message "device timeout" on FreeBSD 7.0...
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: sparc64 (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-sparc64 (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-09-02 22:00 UTC by Paulo Afonso Graner Fessel
Modified: 2010-09-19 16:14 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Paulo Afonso Graner Fessel 2008-09-02 22:00:05 UTC
We have a pair of UltraSparc II servers configured as HA-firewall with
carp and pfsync. I noticed that even with an advskew of zero on the
primary firewall (vtsfprfw01) the carp interfaces end up migrating to
the backup firewall, which has an advskew of 200. Here's ifconfig for
the primary firewall (master):

hme0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=b<RXCSUM,TXCSUM,VLAN_MTU>
        ether 08:00:20:d0:c3:dd
        inet 192.168.0.1 netmask 0xffffff00 broadcast 192.168.0.255
        media: Ethernet 100baseTX <full-duplex>
        status: active
hme1: flags=8b43<UP,BROADCAST,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=b<RXCSUM,TXCSUM,VLAN_MTU>
        ether 08:00:20:bc:a6:b4
        inet 200.215.183.101 netmask 0xfffffff0 broadcast 200.215.183.111
        media: Ethernet 100baseTX <full-duplex>
        status: active
hme2: flags=8b43<UP,BROADCAST,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=b<RXCSUM,TXCSUM,VLAN_MTU>
        ether 08:00:20:bc:a6:b5
        inet 200.143.2.2 netmask 0xffffff00 broadcast 200.143.2.255
        media: Ethernet 100baseTX <full-duplex>
        status: active
hme3: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=b<RXCSUM,TXCSUM,VLAN_MTU>
        ether 08:00:20:bc:a6:b6
        media: Ethernet autoselect
hme4: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=b<RXCSUM,TXCSUM,VLAN_MTU>
        ether 08:00:20:bc:a6:b7
        media: Ethernet autoselect
pflog0: flags=141<UP,RUNNING,PROMISC> metric 0 mtu 33160
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        inet 127.0.0.1 netmask 0xff000000 
pfsync0: flags=41<UP,RUNNING> metric 0 mtu 1460
        pfsync: syncdev: hme0 syncpeer: 192.168.0.2 maxupd: 128
carp0: flags=49<UP,LOOPBACK,RUNNING> metric 0 mtu 1500
        inet 200.215.183.100 netmask 0xfffffff0 
        carp: BACKUP vhid 1 advbase 1 advskew 0
carp1: flags=49<UP,LOOPBACK,RUNNING> metric 0 mtu 1500
        inet 200.143.2.1 netmask 0xffffff00 
        carp: BACKUP vhid 2 advbase 1 advskew 0

And the same, for the second firewall (backup):

vtsfprfw02# ifconfig -a
hme0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=b<RXCSUM,TXCSUM,VLAN_MTU>
        ether 08:00:20:e7:39:31
        inet 192.168.0.2 netmask 0xffffff00 broadcast 192.168.0.255
        media: Ethernet 100baseTX <full-duplex>
        status: active
hme1: flags=8b43<UP,BROADCAST,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=b<RXCSUM,TXCSUM,VLAN_MTU>
        ether 08:00:20:bc:a3:a0
        inet 200.215.183.102 netmask 0xfffffff0 broadcast 200.215.183.111
        media: Ethernet 100baseTX <full-duplex>
        status: active
hme2: flags=8b43<UP,BROADCAST,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=b<RXCSUM,TXCSUM,VLAN_MTU>
        ether 08:00:20:bc:a3:a1
        inet 200.143.2.3 netmask 0xffffff00 broadcast 200.143.2.255
        media: Ethernet 100baseTX <full-duplex>
        status: active
hme3: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=b<RXCSUM,TXCSUM,VLAN_MTU>
        ether 08:00:20:bc:a3:a2
        media: Ethernet autoselect
hme4: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=b<RXCSUM,TXCSUM,VLAN_MTU>
        ether 08:00:20:bc:a3:a3
        media: Ethernet autoselect
pflog0: flags=141<UP,RUNNING,PROMISC> metric 0 mtu 33160
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        inet 127.0.0.1 netmask 0xff000000 
pfsync0: flags=41<UP,RUNNING> metric 0 mtu 1460
        pfsync: syncdev: hme0 syncpeer: 192.168.0.1 maxupd: 128
carp0: flags=49<UP,LOOPBACK,RUNNING> metric 0 mtu 1500
        inet 200.215.183.100 netmask 0xfffffff0 
        carp: MASTER vhid 1 advbase 1 advskew 200
carp1: flags=49<UP,LOOPBACK,RUNNING> metric 0 mtu 1500
        inet 200.143.2.1 netmask 0xffffff00 
        carp: MASTER vhid 2 advbase 1 advskew 200

After noticing this, I also saw that "local-mac-address?" on the first
firewall was set to "false", what caused all the interface ports to show
the same MAC address. I've fixed this and rebooted the server, to
investigate if this had something to do with the issue. Everything was
alright during approximate 30 minutes, when the firewall has changed to
the secondary machine. Here's an excerpt from /var/log/messages from the
primary firewall:

Sep  2 10:18:51 vtsfprfw01 ftp-proxy[929]: listening on 127.0.0.1 port 8021
Sep  2 10:18:51 vtsfprfw01 getty[943]: open /dev/ttyv2: No such file or directory
Sep  2 10:18:51 vtsfprfw01 getty[945]: open /dev/ttyv4: No such file or directory
Sep  2 10:18:51 vtsfprfw01 getty[941]: open /dev/ttyv0: No such file or directory
Sep  2 10:18:51 vtsfprfw01 getty[948]: open /dev/ttyv7: No such file or directory
Sep  2 10:18:51 vtsfprfw01 getty[946]: open /dev/ttyv5: No such file or directory
Sep  2 10:18:51 vtsfprfw01 getty[944]: open /dev/ttyv3: No such file or directory
Sep  2 10:18:51 vtsfprfw01 getty[942]: open /dev/ttyv1: No such file or directory
Sep  2 10:18:51 vtsfprfw01 getty[947]: open /dev/ttyv6: No such file or directory
Sep  2 10:18:58 vtsfprfw01 login: ROOT LOGIN (root) ON ttyu0
Sep  2 10:44:43 vtsfprfw01 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Sep  2 10:44:43 vtsfprfw01 kernel: hme2: device timeout
Sep  2 10:44:48 vtsfprfw01 kernel: carp0: BACKUP -> MASTER (preempting a slower master)
Sep  2 10:44:48 vtsfprfw01 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK)
Sep  2 11:11:48 vtsfprfw01 kernel: hme2: device timeout
Sep  2 11:35:19 vtsfprfw01 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Sep  2 11:35:19 vtsfprfw01 kernel: hme2: device timeout
Sep  2 11:35:24 vtsfprfw01 kernel: carp0: BACKUP -> MASTER (preempting a slower master)
Sep  2 11:35:24 vtsfprfw01 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK)
Sep  2 13:37:16 vtsfprfw01 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)
Sep  2 13:37:16 vtsfprfw01 kernel: hme1: device timeout
Sep  2 13:37:17 vtsfprfw01 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Sep  2 16:11:06 vtsfprfw01 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK)
Sep  2 16:11:14 vtsfprfw01 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)

As it can be seen from the logs, there's a number of messages like
"hmeX: device timeout". When this happens, carp0 and carp1 see this as
a disconnection and are forced to the secondary firewall. The most
interesting is part is that I don't lose communication with either
machines: I'm able to ping the first firewall normally after the event,
and I don't get messages neither in the OS or on the switch pointing to
link loss.

The secondary server also shows the same problem (hmeX device timeout):

Sep  1 15:58:00 vtsfprfw02 kernel: hme0: device timeout
Sep  1 16:35:00 vtsfprfw02 kernel: hme0: device timeout
Sep  1 16:35:40 vtsfprfw02 last message repeated 2 times
Sep  1 16:45:33 vtsfprfw02 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)
Sep  1 16:45:34 vtsfprfw02 kernel: hme1: device timeout
Sep  1 16:45:35 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Sep  1 16:45:47 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK)
Sep  1 16:45:47 vtsfprfw02 kernel: carp1: BACKUP -> MASTER (preempting a slower master)
Sep  1 16:45:47 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK)
Sep  1 16:45:48 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Sep  1 16:45:48 vtsfprfw02 kernel: carp0: BACKUP -> MASTER (preempting a slower master)
Sep  1 16:45:48 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK)
Sep  1 17:12:16 vtsfprfw02 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)
Sep  1 17:12:16 vtsfprfw02 kernel: hme1: device timeout
Sep  1 17:12:17 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Sep  1 17:44:33 vtsfprfw02 kernel: arp: 200.215.183.110 is on hme1 but got reply from 00:14:d1:38:92:ba on hme2
Sep  1 17:45:05 vtsfprfw02 last message repeated 21 times
Sep  1 17:45:27 vtsfprfw02 last message repeated 15 times
Sep  1 18:55:48 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK)
Sep  1 18:55:48 vtsfprfw02 kernel: carp1: BACKUP -> MASTER (preempting a slower master)
Sep  1 18:55:48 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK)
Sep  1 18:55:48 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Sep  1 18:55:49 vtsfprfw02 kernel: carp0: BACKUP -> MASTER (preempting a slower master)
Sep  1 18:55:49 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK)
Sep  1 19:24:43 vtsfprfw02 sshd[34638]: error: PAM: authentication error for pfessel from 201.20.234.104
Sep  1 19:24:46 vtsfprfw02 sshd[34641]: error: ssh_msg_send: write
Sep  2 10:19:31 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Sep  2 10:19:50 vtsfprfw02 kernel: arp: 200.143.2.2 moved from 08:00:20:d0:c3:dd to 08:00:20:bc:a6:b5 on hme2
Sep  2 10:19:51 vtsfprfw02 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)
Sep  2 10:45:15 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK)
Sep  2 10:45:15 vtsfprfw02 kernel: carp0: BACKUP -> MASTER (preempting a slower master)
Sep  2 10:45:15 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK)
Sep  2 10:45:20 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Sep  2 10:45:26 vtsfprfw02 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)
Sep  2 11:12:19 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK)
Sep  2 11:12:32 vtsfprfw02 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)
Sep  2 11:35:51 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK)
Sep  2 11:35:51 vtsfprfw02 kernel: carp0: BACKUP -> MASTER (preempting a slower master)
Sep  2 11:35:51 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK)
Sep  2 11:35:56 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Sep  2 11:36:06 vtsfprfw02 kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)
Sep  2 13:37:47 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK)
Sep  2 13:37:48 vtsfprfw02 kernel: carp1: BACKUP -> MASTER (preempting a slower master)
Sep  2 13:37:48 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.143.2.1 (!AF_LINK)
Sep  2 13:37:48 vtsfprfw02 kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Sep  2 13:37:49 vtsfprfw02 kernel: carp0: BACKUP -> MASTER (preempting a slower master)
Sep  2 13:37:49 vtsfprfw02 kernel: arp_rtrequest: bad gateway 200.215.183.100 (!AF_LINK)
Sep  2 14:01:06 vtsfprfw02 sshd[37221]: error: PAM: authentication error for root from 200.143.2.125
Sep  2 14:01:08 vtsfprfw02 sshd[37221]: error: PAM: authentication error for root from 200.143.2.125
Sep  2 16:07:37 vtsfprfw02 sshd[37559]: error: PAM: authentication error for root from 200.143.2.125
Sep  2 16:08:17 vtsfprfw02 last message repeated 3 times
Sep  2 16:08:35 vtsfprfw02 sshd[37559]: error: PAM: authentication error for root from 200.143.2.125
Sep  2 16:08:38 vtsfprfw02 sshd[37566]: error: ssh_msg_send: write
Sep  2 16:08:50 vtsfprfw02 sshd[37567]: error: PAM: authentication error for root from 200.143.2.125
Sep  2 16:09:00 vtsfprfw02 last message repeated 5 times
Sep  2 16:11:40 vtsfprfw02 kernel: hme2: device timeout

This log is even more interesting, as it's shown here there's the first
transition of the carp interfaces from MASTER to BACKUP from when I first
rebooted the master firewall to fix the eeprom issue; but there are also
other transitions from BACKUP to MASTER to BACKUP to MASTER again, which
haven't been logged in the master firewall!

Now, some background information about network topology.

* carp0 is connected to hme1 in both servers (master and backup), and
  points to our default internet gateway;
* carp1 is connected to hme2 in both servers (master and backup), and
  points to our data center's front-end network (200.143.2.0/24)
* pfsync0 is associated to hme0 in both servers. We don't use a back-to-
  back connection via crossover cable, but instead we use a dedicated
  VLAN on one of our switches. Because of this I've chosen to specify
  syncpeers on both servers, with the master firewall pointing to the
  backup firewall and vice-versa, always using hme0 as the physical
  device for pfsync0. I've checked that VHIDs from other firewalls which
  use this VLAN don't coincide with the ones we use in this particular
  configuration.

Finally, hardware configuration follows:

* Master firewall: E220R with 2 USII 450 MHz processors, 2 SUN18G SCSI
  hard disks, 2 GB RAM, one Sun QFE PCI interface (hme1-hme4 ports) plus
  the onboard HME interface (hme0).

* Backup firewall: E420R with 2 USII 450 MHz processors, 2 SUN18G SCSI
  hard disks, 2 GB RAM, one Sun QFE PCI interface (hme1-hme4 ports) plus
  the onboard HME interface (hme0).

As you can see, both machines are nearly identical.

My feeling is that this behaviour has something to do with something
specific to sparc64 architecture, as I do have other 2 Sun QFE PCI boards
running on Intel architecture (Dell Poweredges 1650 and 1750 respectively)
and there are no such issues whatsoever. These i386 servers have already
an uptime of 11 days and there is no "hmeX: device timeout" in either server.

How-To-Repeat: Just configure the networking in the servers as shown and then is just a
question of time for the problem to appear. It can take days, hours, or
minutes, but it will show up after some time.
Comment 1 marius 2008-09-04 22:25:15 UTC
On Tue, Sep 02, 2008 at 08:58:56PM +0000, Paulo Afonso Graner Fessel wrote:
> 
> >Environment:
> FreeBSD vtsfprfw01.virtus-ti.com.br 7.0-RELEASE FreeBSD 7.0-RELEASE #7: Fri Jun 20 19:29:52 BRT 2008     root@vtsuprfw01.dedalusprime.com.br:/usr/obj/usr/src/sys/VIRTUSFW  sparc64

Please give 7.0-STABLE/7.1-PRERELEASE a try, there where several
bugs in hme(4) fixed since 7.0-RELEASE.

Marius
Comment 2 Marius Strobl freebsd_committer freebsd_triage 2008-10-06 20:03:49 UTC
State Changed
From-To: open->feedback

Feedback provided, waiting for submitter to test.
Comment 3 Jaakko Heinonen freebsd_committer freebsd_triage 2010-09-19 16:14:23 UTC
State Changed
From-To: feedback->closed

Feedback timeout.