Created attachment 190888 [details] patch for usr.bin/awk/awk.1 to clarify the use of FS I realise that awk is a contributed package, but the man page for it has been pulled in from OpenBSD, and has since been locally modified: https://svnweb.freebsd.org/base/head/usr.bin/awk/awk.1?view=log Anyway, the man page assumes that apart from the exceptions mentioned, the FS is a regular expression. This is not the case for a single character - A single character FS is considered literal. Attached patch clarifies this. Note that the man page for gawk is correct on the subject. Cheers
A RE can be a literal of one or more characters. From re_format(7): A branch is one‡ or more pieces, concatenated. A piece is an atom possibly followed by a single‡ ‘*’, ‘+’, ‘?’, or bound. An atom is ... , or a single character ... Consider: % echo foobar | awk -F o -v OFS=X '{$1=$1;print}' fXXbar % echo foobar | awk -F oo -v OFS=X '{$1=$1;print}' fXbar % I don't think there is a need to separately reference single-character literal REs. (And unless I'm missing something I think the gawk man page is misguided for doing so.)
Hi. Thanks for the reply. I'll explain how I got here: I wanted to do a quick hack to split a line at every character, and at that point, I was not familiar with "awk" allowing a null character to do the job. Hence, believing the strings was a regular expression, I set FS to "." which - contrary to the manual - was taken as a literal, not a RE! Indeed, in your description of the atom from re_format, you missed out: "or a single character with no other significance (matching that character)." As in your examples: You used examples where a single character is already a literal character in RE, which isn't always the case: % printf 'hello(world' | egrep '(' egrep: Unmatched ( or \( % printf 'hello(world' | awk -F '(' '{print $1}' hello I know this is hardly a major error, but it is still inaccurate - especially in the case of "."! Just for info, the actual text from gawk (which probably phrases it better than I did!) is: "If FS is a single character, fields are separated by that character. If FS is the null string, then each individual character becomes a separate field. Otherwise, FS is expected to be a full regular expression." Cheers, Jamie
Ok, clearly I *was* missing something. This probably warrants some further contemplation since the same exceptions apply to the "split()" function. Perhaps the pertinent references to "regular expression" should be removed and a separate paragraph or section added to discuss "field separators", rather than repeating the exceptions in multiple places. Agreed it renders the awk(1) man page inaccurate and it is one of the hallmarks of FreeBSD that its man pages provide accurate reference information. In fact it has stirred a vague memory in me of having been caught out by this exception. Well spotted!
Returning to pool.
This change isn't technically correct. Space is special, and that's not documented. What is done when a single character is not well documented. An input line is normally made up of fields separated by whitespace, or by the extended regular expression FS as described below. The fields are denoted $1, $2, ..., while $0 refers to the entire line. If FS is null, the input line is split into one field per character. However, this behavior is unspecified in the IEEE Std 1003.1 ("POSIX.1") standard. If FS is a single space, then leading and trailing blank and newline characters are skipped. Fields are delimited by one or more blank or newline characters. A blank character is a space or a tab. If FS is a single character, other than space, fields are delimited by each single occurrence of that character. The FS variable defaults to a single space. is what I have in my tree to fix this paragraph. I'll commit it shortly after waiting a week for feedback.
A commit in branch main references this bug: URL: https://cgit.FreeBSD.org/src/commit/?id=b891aedcdd2d9a3e1530e45f6b785b768cccc466 commit b891aedcdd2d9a3e1530e45f6b785b768cccc466 Author: Warner Losh <imp@FreeBSD.org> AuthorDate: 2021-07-20 02:10:22 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2021-07-20 04:33:26 +0000 awk: Add more details top the FS variable The current description of the FS is true, but only part of the truth. Add information about single characters and note that FS="" is undefined by the standard, though the two other awk implenetations (mawk and gawk) also have this interpretation. PR: 226112 Sponsored by: Netflix usr.bin/awk/awk.1 | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-)
(In reply to Warner Losh from comment #5) Thanks Warner. That looks perfect to me. I had actually originally thought that setting FS to a single space made it literal (i.e. different from the default FS of leading/trailing-stripped-collapsed-whitespace) but now I realise you need to use '[ ]' to achieve that condition. Cheers, Jamie
(In reply to Warner Losh from comment #5) One thing: As Wayne pointed out in an earlier comment, these same rules and exceptions to FS also apply to the "split" function, although the "fs" parameter there is just described as a regular expression.
^Triage: committed back in 2021.