Bug 243229 - awk length() function in base system produces an incorrect results for UTF-8 strings
Summary: awk length() function in base system produces an incorrect results for UTF-8 ...
Status: New
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 12.1-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: freebsd-bugs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-01-09 21:20 UTC by Serge Volkov
Modified: 2020-03-10 17:53 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Serge Volkov 2020-01-09 21:20:49 UTC
I tried using the function length() with UTF-8 strings. And this function produces an incorrect result. The function works with strings not as characters, but as bytes. And the number of characters per string is multiplied by two.

Steps to reproduce (for LANG=ru_RU.UTF-8):

echo 'Привет' | awk '{print length($1)}'

If I use the function length() with lang/gawk, then UTF-8 string length is calculated correctly.

Are you planning to update awk in the base system to support UTF-8 strings in the near future?
Comment 1 Conrad Meyer freebsd_committer 2020-01-10 01:47:07 UTC
I'm not sure it makes sense to compute length() on UTF-8 strings as unicode codepoints.  POSIX awk is somewhat clear that you're correct:


> LC_CTYPE
> Determine the locale for the interpretation of sequences of bytes of text
> data as characters (for example, single-byte as opposed to multi-byte
> characters in arguments and input files), the behavior of character classes
> within regular expressions, the identification of characters as letters, and
> the mapping of uppercase and lowercase characters for the toupper and
> tolower functions.

However, the resulting behavior around indexing is nutty: this implies that index(), match(), etc, are measured in *characters*.  To do this efficiently one probably has to convert non-ASCII strings to wchar_t and operate on those.  As you could imagine, that would immensely slow down awk as a fast stream processing utility.

POSIX is more explicit about toupper() and tolower(), where taking locale into consideration is easier.

I guess I'm not clear on what value a length() function is that operates on codepoints rather than bytes.