Bug 243229 - awk length() function in base system produces an incorrect results for UTF-8 strings
Summary: awk length() function in base system produces an incorrect results for UTF-8 ...
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 12.1-RELEASE
Hardware: amd64 Any
: --- Affects Some People
Assignee: Warner Losh
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-01-09 21:20 UTC by Serge Volkov
Modified: 2021-07-20 02:15 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Serge Volkov 2020-01-09 21:20:49 UTC
I tried using the function length() with UTF-8 strings. And this function produces an incorrect result. The function works with strings not as characters, but as bytes. And the number of characters per string is multiplied by two.

Steps to reproduce (for LANG=ru_RU.UTF-8):

echo 'Привет' | awk '{print length($1)}'

If I use the function length() with lang/gawk, then UTF-8 string length is calculated correctly.

Are you planning to update awk in the base system to support UTF-8 strings in the near future?
Comment 1 Conrad Meyer freebsd_committer 2020-01-10 01:47:07 UTC
I'm not sure it makes sense to compute length() on UTF-8 strings as unicode codepoints.  POSIX awk is somewhat clear that you're correct:


> LC_CTYPE
> Determine the locale for the interpretation of sequences of bytes of text
> data as characters (for example, single-byte as opposed to multi-byte
> characters in arguments and input files), the behavior of character classes
> within regular expressions, the identification of characters as letters, and
> the mapping of uppercase and lowercase characters for the toupper and
> tolower functions.

However, the resulting behavior around indexing is nutty: this implies that index(), match(), etc, are measured in *characters*.  To do this efficiently one probably has to convert non-ASCII strings to wchar_t and operate on those.  As you could imagine, that would immensely slow down awk as a fast stream processing utility.

POSIX is more explicit about toupper() and tolower(), where taking locale into consideration is easier.

I guess I'm not clear on what value a length() function is that operates on codepoints rather than bytes.
Comment 2 Frédéric Fauberteau 2021-05-19 14:59:46 UTC
I don't know if this issue is related to that bug report, but the following command prints 'bin':
% echo "bin" | LANG=en_US awk '$1 ~ /^[\t -~]/ {print $0}'
while this one prints nothing:
echo "bin" | LANG=en_US.UTF-8 awk '$1 ~ /^[\t -~]/ {print $0}'

The range from ' ' to '~' includes alphabetical characters when the locale is not utf-8 but does not when the locale is utf-8.

We can notice that '/^[\t -~]/' matches "bin" with C.UTF-8.
Comment 3 Warner Losh freebsd_committer 2021-07-20 02:15:15 UTC
https://reviews.freebsd.org/D31114

discusses the matter of locales.