Bug 226112

Summary: awk(1) man page unclear about field separator, FS
Product: Documentation Reporter: Jamie Landeg-Jones <jamie>
Component: Manual PagesAssignee: Warner Losh <imp>
Status: In Progress ---    
Severity: Affects Some People CC: 0mp, doc, imp, jamie, ws, ygy
Priority: --- Keywords: patch
Version: Latest   
Hardware: Any   
OS: Any   
Bug Depends on:    
Bug Blocks: 230730    
Attachments:
Description Flags
patch for usr.bin/awk/awk.1 to clarify the use of FS none

Description Jamie Landeg-Jones 2018-02-22 10:48:54 UTC
Created attachment 190888 [details]
patch for usr.bin/awk/awk.1 to clarify the use of FS

I realise that awk is a contributed package, but the man page for it has been pulled in from OpenBSD, and has since been locally modified: https://svnweb.freebsd.org/base/head/usr.bin/awk/awk.1?view=log

Anyway, the man page assumes that apart from the exceptions mentioned, the FS is a regular expression. This is not the case for a single character - A single character FS is considered literal.

Attached patch clarifies this. Note that the man page for gawk is correct on the subject.

Cheers
Comment 1 Wayne Sierke 2018-02-24 05:08:03 UTC
A RE can be a literal of one or more characters. From re_format(7):

  A branch is one‡ or more pieces, concatenated.
  A piece is an atom possibly followed by a single‡ ‘*’, ‘+’, ‘?’, or
     bound.
  An atom is ... , or a single character ...


Consider:

  % echo foobar | awk -F o -v OFS=X '{$1=$1;print}'
  fXXbar
  % echo foobar | awk -F oo -v OFS=X '{$1=$1;print}'
  fXbar
  %


I don't think there is a need to separately reference single-character literal REs.

(And unless I'm missing something I think the gawk man page is misguided for doing so.)
Comment 2 Jamie Landeg-Jones 2018-02-24 20:41:03 UTC
Hi. Thanks for the reply.

I'll explain how I got here:

I wanted to do a quick hack to split a line at every character, and at that point, I was not familiar with "awk" allowing a null character to do the job.

Hence, believing the strings was a regular expression, I set FS to "." which - contrary to the manual - was taken as a literal, not a RE!

Indeed, in your description of the atom from re_format, you missed out:

"or a single character with no other significance (matching that character)."

As in your examples: You used examples where a single character is already a literal character in RE, which isn't always the case:

% printf 'hello(world' | egrep '('
egrep: Unmatched ( or \(
% printf 'hello(world' | awk -F '(' '{print $1}'
hello

I know this is hardly a major error, but it is still inaccurate - especially in the case of "."!

Just for info, the actual text from gawk (which probably phrases it better than I did!) is:

"If FS is a single character, fields are separated by that character.  If FS is the null string, then each individual character becomes a separate field.  Otherwise, FS is expected to be a full regular expression."

Cheers, Jamie
Comment 3 Wayne Sierke 2018-02-26 05:12:50 UTC
Ok, clearly I *was* missing something.

This probably warrants some further contemplation since the same exceptions apply to the "split()" function. Perhaps the pertinent references to "regular expression" should be removed and a separate paragraph or section added to discuss "field separators", rather than repeating the exceptions in multiple places.

Agreed it renders the awk(1) man page inaccurate and it is one of the hallmarks of FreeBSD that its man pages provide accurate reference information. In fact it has stirred a vague memory in me of having been caught out by this exception. Well spotted!
Comment 4 Guangyuan Yang freebsd_committer 2020-12-26 13:29:52 UTC
Returning to pool.
Comment 5 Warner Losh freebsd_committer 2021-07-20 02:12:49 UTC
This change isn't technically correct.

Space is special, and that's not documented.
What is done when a single character is not well documented.

     An input line is normally made up of fields separated by whitespace, or
     by the extended regular expression FS as described below.  The fields are
     denoted $1, $2, ..., while $0 refers to the entire line.  If FS is null,
     the input line is split into one field per character.  However, this
     behavior is unspecified in the IEEE Std 1003.1 ("POSIX.1") standard.  If
     FS is a single space, then leading and trailing blank and newline
     characters are skipped.  Fields are delimited by one or more blank or
     newline characters.  A blank character is a space or a tab.  If FS is a
     single character, other than space, fields are delimited by each single
     occurrence of that character.  The FS variable defaults to a single
     space.

is what I have in my tree to fix this paragraph. I'll commit it shortly after
waiting a week for feedback.
Comment 6 commit-hook freebsd_committer 2021-07-20 04:34:37 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=b891aedcdd2d9a3e1530e45f6b785b768cccc466

commit b891aedcdd2d9a3e1530e45f6b785b768cccc466
Author:     Warner Losh <imp@FreeBSD.org>
AuthorDate: 2021-07-20 02:10:22 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2021-07-20 04:33:26 +0000

    awk: Add more details top the FS variable

    The current description of the FS is true, but only part of the
    truth. Add information about single characters and note that FS="" is
    undefined by the standard, though the two other awk implenetations (mawk
    and gawk) also have this interpretation.

    PR:             226112
    Sponsored by:   Netflix

 usr.bin/awk/awk.1 | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)
Comment 7 Jamie Landeg-Jones 2021-07-22 01:08:01 UTC
(In reply to Warner Losh from comment #5)

Thanks Warner. That looks perfect to me.

I had actually originally thought that setting FS to a single space made it literal (i.e. different from the default FS of leading/trailing-stripped-collapsed-whitespace) but now I realise you need to use '[ ]' to achieve that condition.

Cheers, Jamie
Comment 8 Jamie Landeg-Jones 2021-07-22 01:37:33 UTC
(In reply to Warner Losh from comment #5)

One thing: As Wayne pointed out in an earlier comment, these same rules and exceptions to FS also apply to the "split" function, although the "fs" parameter there is just described as a regular expression.