Bug 241441 - inconsistency between allowed empty regex for `awk -F` and split()
Summary: inconsistency between allowed empty regex for `awk -F` and split()
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 12.0-STABLE
Hardware: Any Any
: --- Affects Some People
Assignee: Warner Losh
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-23 19:20 UTC by Tim Chase
Modified: 2021-07-31 05:51 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Chase 2019-10-23 19:20:58 UTC
I get an error when I try to use an empty regex for the field separator:

  $ echo hello | awk -F '' '{print $2}'
  awk: field separator FS is empty

but awk has no issues splitting things on an empty regex:

  $ awk 'BEGIN{s="hello"; split(s, a, ""); print a[1]}'
  h

Over on gawk, I get the expected behavior

  $ echo hello | awk -F '' '{print $1}'
  h

This is somewhat similar to #226112

  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=226112

I get that awk uses EREs and `man re_format`  says that "A (modern [Extended]) RE is one or more non-empty branches, separated by '|'", but

1) that's not what split() does

2) it's not what gawk's -F parameter does

3) permitting an empty regex for splitting already seems supported in awk code (as the split example shows) and shouldn't break any existing usage

4) as a non-workaround, `man re_format` says that the atom "()" matches the null string, but

  $ echo hello | awk -F '()' '{print $1}'

doesn't split the row on the null regular expression (FWIW, gawk gives the same results when using "()" as the split pattern).

In an ideal world, the behavior would match the behavior of gawk & the split() function, splitting the record into each individual character.
Comment 1 Warner Losh freebsd_committer freebsd_triage 2021-07-20 03:47:22 UTC
The standard states that FS='' is undefined behavior.
It also states that -F sepstring and -v FS=sepstring are identical.

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html

however, one true awk treats them differently.
Comment 2 Warner Losh freebsd_committer freebsd_triage 2021-07-20 03:59:00 UTC
I've filed the following

https://github.com/onetrueawk/awk/issues/127

upstream. This seems inconsistent, especially since FS="" has well documented behavior in awk(1) from upstream.
Comment 3 commit-hook freebsd_committer freebsd_triage 2021-07-24 15:09:43 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=a2e3e1187309f9404940b61ca49a93bd0536559d

commit a2e3e1187309f9404940b61ca49a93bd0536559d
Author:     Warner Losh <imp@FreeBSD.org>
AuthorDate: 2021-07-20 04:47:30 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2021-07-24 15:08:16 +0000

    awk: Make -F '' and -v FS="" behave the same

    IEEE Std 1003.1-2008 mandates that -F str be treated the same as -v
    FS=str. For a null string, this was not the case. Since awk(1) documents
    that a null string for FS has a specific behavior, make -F '' behave
    consistently with -v FS="".

    PR:                     241441
    Upstream issue:         https://github.com/onetrueawk/awk/issues/127
    Upstream pull request:  https://github.com/onetrueawk/awk/pull/128
    MFC After:              2 weeks
    Sponsored by:           Netflix

 contrib/one-true-awk/main.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)
Comment 4 commit-hook freebsd_committer freebsd_triage 2021-07-30 23:35:04 UTC
A commit in branch stable/13 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=f4ed53c6f5254edcc28c34cbe67d698bd93cb05e

commit f4ed53c6f5254edcc28c34cbe67d698bd93cb05e
Author:     Warner Losh <imp@FreeBSD.org>
AuthorDate: 2021-07-20 04:47:30 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2021-07-30 23:02:13 +0000

    awk: Make -F '' and -v FS="" behave the same

    IEEE Std 1003.1-2008 mandates that -F str be treated the same as -v
    FS=str. For a null string, this was not the case. Since awk(1) documents
    that a null string for FS has a specific behavior, make -F '' behave
    consistently with -v FS="".

    PR:                     241441
    Upstream issue:         https://github.com/onetrueawk/awk/issues/127
    Upstream pull request:  https://github.com/onetrueawk/awk/pull/128
    MFC After:              2 weeks
    Sponsored by:           Netflix

    (cherry picked from commit a2e3e1187309f9404940b61ca49a93bd0536559d)

 contrib/one-true-awk/main.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)
Comment 5 commit-hook freebsd_committer freebsd_triage 2021-07-31 00:22:17 UTC
A commit in branch stable/12 references this bug:

URL: https://cgit.FreeBSD.org/src/commit/?id=ab1dedd4946098fe7202e825d299a2cbec81dae0

commit ab1dedd4946098fe7202e825d299a2cbec81dae0
Author:     Warner Losh <imp@FreeBSD.org>
AuthorDate: 2021-07-20 04:47:30 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2021-07-31 00:02:51 +0000

    awk: Make -F '' and -v FS="" behave the same

    IEEE Std 1003.1-2008 mandates that -F str be treated the same as -v
    FS=str. For a null string, this was not the case. Since awk(1) documents
    that a null string for FS has a specific behavior, make -F '' behave
    consistently with -v FS="".

    PR:                     241441
    Upstream issue:         https://github.com/onetrueawk/awk/issues/127
    Upstream pull request:  https://github.com/onetrueawk/awk/pull/128
    MFC After:              2 weeks
    Sponsored by:           Netflix

    (cherry picked from commit a2e3e1187309f9404940b61ca49a93bd0536559d)

 contrib/one-true-awk/main.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)