Bug 277875

Summary: pfctl cowardly refuses to load rules, broken between 8c94ed992702 & f29af8618bf9
Product: Base System Reporter: Dave Cottlehuber <dch>
Component: binAssignee: freebsd-net (Nobody) <net>
Status: Closed Unable to Reproduce    
Severity: Affects Only Me CC: emaste, freebsd, kp
Priority: ---    
Version: 15.0-CURRENT   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
abridged pf.conf
none
codel config
none
truss log none

Description Dave Cottlehuber freebsd_committer freebsd_triage 2024-03-21 23:00:33 UTC
Created attachment 249387 [details]
abridged pf.conf

after 8c94ed992702, servers behind firewall are unable to ping, dns, etc.

I think this is because `pfctl -vvef /etc/pf.conf` returns 1,
and whines about ALTQ even though we're not using any ALTQ function.

Removing dummynet config doesn't seem to address the issue.

Reverting to last boot env built off f29af8618bf9 and all is well.

- abridged pf.conf attached, full one available as needed.
- h/w is ten64 arm64 router running 15.0-CURRENT.
Comment 1 Dave Cottlehuber freebsd_committer freebsd_triage 2024-03-21 23:01:14 UTC
Created attachment 249388 [details]
codel config
Comment 2 Kristof Provost freebsd_committer freebsd_triage 2024-03-22 02:31:00 UTC
This is interesting. I can load the abridged pf.conf, but it seems to hallucinate ridentifier settings which are not in the file.

That may point towards some memory corruption during the parsing, but valgrind doesn't immediately seem to detect anything.

I see the above behaviour on both amd64 and arm64.
Comment 3 commit-hook freebsd_committer freebsd_triage 2024-03-22 08:37:33 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=88f557a2a9c31aaa954841b06d38eec84a1702fc

commit 88f557a2a9c31aaa954841b06d38eec84a1702fc
Author:     Kristof Provost <kp@FreeBSD.org>
AuthorDate: 2024-03-22 03:21:50 +0000
Commit:     Kristof Provost <kp@FreeBSD.org>
CommitDate: 2024-03-22 08:00:05 +0000

    libpfctl: fix incorrect labels copy

    We copied the entire parsed_labels struct, including the counter to a
    field that was only big enough for the labels (so not the counter).

    PR:             277875
    MFC after:      1 week

 lib/libpfctl/libpfctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 4 Dave Cottlehuber freebsd_committer freebsd_triage 2024-03-23 22:23:54 UTC
Created attachment 249438 [details]
truss log

Thanks, rebuilt with that patch included.

I reduced the failing ruleset to this minimal example:

```
# pfctl -s Running
Enabled
# pfctl -F all
Ethernet rules cleared
rules cleared
nat cleared
0 tables deleted.
0 states cleared
source tracking entries cleared
pf: statistics cleared
pf: interface flags reset
root@# echo 'pass in quick on ng0 proto tcp to port 2200' | pfctl -vgf -
No ALTQ support in kernel
ALTQ related functions disabled
pass in quick on ng0 proto tcp from any to any port = 2200 flags S/SA keep state
# echo $status
1
# pfctl -s rules
#
```


Evidently its not a ruleset parsing issue.

I swapped ng0 for lo0 and the same situation occurs.

running under truss, final lines from attached full log:

ioctl(3,DIOCSETTIMEOUT,0x621da911a368)		 = 0 (0x0)
ioctl(3,DIOCSETTIMEOUT,0x621da911a368)		 = 0 (0x0)
ioctl(3,DIOCSETDEBUG,0x621da911a368)		 = 0 (0x0)
sendto(5," \0\0\0\^P\0\^E\0\^A\0\0\0\0\0\0"...,32,0,NULL,0) = 32 (0x20)
recvmsg(5,{0x621da911a26c,12,[{"\M-x\0\0\0\^P\0\^E\0\^A\0\0\0\0"...,65536}],1,{},0,0},0) = 284 (0x11c)
sendto(5,"\^\\0\0\0\^Q\0\^E\0\^B\0\0\0\0\0"...,28,0,NULL,0) = 28 (0x1c)
recvmsg(5,{0x621da911a26c,12,[{"0\0\0\0\^B\0\0\0\^B\0\0\0\0\0\0"...,65536}],1,{},0,0},0) = 48 (0x30)
ioctl(3,DIOCSETHOSTID,0x621da911a368)		 = 0 (0x0)
ioctl(3,DIOCSETREASS,0x621da911a368)		 = 0 (0x0)
ioctl(3,DIOCKEEPCOUNTERS,0x621da911a310)	 = 0 (0x0)
ioctl(3,DIOCGETLIMIT,0x621da911a300)		 = 0 (0x0)
ioctl(3,DIOCSETSYNCOOKIES,0x621da911a300)	 = 0 (0x0)
ioctl(3,DIOCXROLLBACK,0x621da911a398)		 = 0 (0x0)
extl_if = "ng0"
pass in quick on ng0 proto tcp from any to any port = 2200 flags S/SA keep state
write(1,"extl_if = "ng0"\npass in quick o"...,97) = 97 (0x61)
exit(0x1)					
process exit, rval = 1


trying the same ruleset on a different arm64 box with same from-source
build, it works as expected - rules loaded, and output displayed.

I'll do a full re-install into an empty BE next.
Comment 5 Kristof Provost freebsd_committer freebsd_triage 2024-04-01 08:24:55 UTC
That truss output is strange.

We only DIOCXROLLBACK from pfctl_rules() (in sbin/pfctl), and then only after a 'goto _error'.

That must mean we've failed to load one of the options in pfctl_load_options(). All but one of those are old-style ioctls and show no errors, so that would imply that it has to be pfctl_load_logif() (i.e. pfctl_set_statusif() in libpfctl) that fails.
However, that can really only fail if the log interface name is too long, and that does not appear to be the case here.

That's a newly converted-to-netlink call, so at least that's somewhat plausible at a source of shiny new bugs.

It's also all we have to go on right now. Can you try running `dtrace -n 'fbt::pf_handle_set_statusif:return { printf("%#x %#x", arg0, arg1); }'` and then loading the relevant pf.conf?
Comment 6 Dave Cottlehuber freebsd_committer freebsd_triage 2024-04-01 11:44:09 UTC
running `pfctl -vf /etc/pf.comf` repeatedly

dtrace: description 'fbt::pf_handle_set_statusif:return ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  4  61685    pf_handle_set_statusif:return 0xe8 0
  4  61685    pf_handle_set_statusif:return 0xe8 0
  1  61685    pf_handle_set_statusif:return 0xe8 0
  4  61685    pf_handle_set_statusif:return 0xe8 0
  1  61685    pf_handle_set_statusif:return 0xe8 0
  3  61685    pf_handle_set_statusif:return 0xe8 0
Comment 7 Dave Cottlehuber freebsd_committer freebsd_triage 2024-04-01 11:50:59 UTC
too soon, this last one is after rebasing to 319a5d086b50 (upstream/main) if_bridge: use IF_MINMTU.
Comment 8 Dave Cottlehuber freebsd_committer freebsd_triage 2024-04-02 07:30:35 UTC
This looks like a mismatched userland vs kernel, after dtrace reporting
invalid probe specifier.

After a full rebuild with empty /usr/obj and a fresh beinstall, this
can't be reproduced.

Not present in latest CURRENT either. 

thanks kp for your help!