Bug 226112 - awk(1) man page unclear about field separator, FS
Summary: awk(1) man page unclear about field separator, FS
Status: Open
Alias: None
Product: Documentation
Classification: Unclassified
Component: Manual Pages (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Some People
Assignee: Guangyuan Yang
URL:
Keywords: patch
Depends on:
Blocks: 230730
  Show dependency treegraph
 
Reported: 2018-02-22 10:48 UTC by Jamie Landeg-Jones
Modified: 2018-11-27 08:53 UTC (History)
5 users (show)

See Also:


Attachments
patch for usr.bin/awk/awk.1 to clarify the use of FS (990 bytes, text/plain)
2018-02-22 10:48 UTC, Jamie Landeg-Jones
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jamie Landeg-Jones 2018-02-22 10:48:54 UTC
Created attachment 190888 [details]
patch for usr.bin/awk/awk.1 to clarify the use of FS

I realise that awk is a contributed package, but the man page for it has been pulled in from OpenBSD, and has since been locally modified: https://svnweb.freebsd.org/base/head/usr.bin/awk/awk.1?view=log

Anyway, the man page assumes that apart from the exceptions mentioned, the FS is a regular expression. This is not the case for a single character - A single character FS is considered literal.

Attached patch clarifies this. Note that the man page for gawk is correct on the subject.

Cheers
Comment 1 Wayne Sierke 2018-02-24 05:08:03 UTC
A RE can be a literal of one or more characters. From re_format(7):

  A branch is one‡ or more pieces, concatenated.
  A piece is an atom possibly followed by a single‡ ‘*’, ‘+’, ‘?’, or
     bound.
  An atom is ... , or a single character ...


Consider:

  % echo foobar | awk -F o -v OFS=X '{$1=$1;print}'
  fXXbar
  % echo foobar | awk -F oo -v OFS=X '{$1=$1;print}'
  fXbar
  %


I don't think there is a need to separately reference single-character literal REs.

(And unless I'm missing something I think the gawk man page is misguided for doing so.)
Comment 2 Jamie Landeg-Jones 2018-02-24 20:41:03 UTC
Hi. Thanks for the reply.

I'll explain how I got here:

I wanted to do a quick hack to split a line at every character, and at that point, I was not familiar with "awk" allowing a null character to do the job.

Hence, believing the strings was a regular expression, I set FS to "." which - contrary to the manual - was taken as a literal, not a RE!

Indeed, in your description of the atom from re_format, you missed out:

"or a single character with no other significance (matching that character)."

As in your examples: You used examples where a single character is already a literal character in RE, which isn't always the case:

% printf 'hello(world' | egrep '('
egrep: Unmatched ( or \(
% printf 'hello(world' | awk -F '(' '{print $1}'
hello

I know this is hardly a major error, but it is still inaccurate - especially in the case of "."!

Just for info, the actual text from gawk (which probably phrases it better than I did!) is:

"If FS is a single character, fields are separated by that character.  If FS is the null string, then each individual character becomes a separate field.  Otherwise, FS is expected to be a full regular expression."

Cheers, Jamie
Comment 3 Wayne Sierke 2018-02-26 05:12:50 UTC
Ok, clearly I *was* missing something.

This probably warrants some further contemplation since the same exceptions apply to the "split()" function. Perhaps the pertinent references to "regular expression" should be removed and a separate paragraph or section added to discuss "field separators", rather than repeating the exceptions in multiple places.

Agreed it renders the awk(1) man page inaccurate and it is one of the hallmarks of FreeBSD that its man pages provide accurate reference information. In fact it has stirred a vague memory in me of having been caught out by this exception. Well spotted!