Bug 172862 - sed(1) improperly deals with escape chars
Summary: sed(1) improperly deals with escape chars
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: 9.1-STABLE
Hardware: Any Any
: Normal Affects Only Me
Assignee: Kyle Evans
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-10-18 22:10 UTC by Enji Cooper
Modified: 2020-06-07 20:52 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Enji Cooper freebsd_committer freebsd_triage 2012-10-18 22:10:00 UTC
sed doesn't appear to be doing the right thing with escape chars (in this case '\t'); it's not properly reinterpreting '\t' as \011, but is instead interpreting it was 't':

$ echo "foot " | sed -e 's/[\\t ]*$//' | hexdump -C
00000000  66 6f 6f 0a                                       |foo.|
00000004
$ echo "foot " | sed -E -e 's/[\\t ]*$//' | hexdump -C
00000000  66 6f 6f 0a                                       |foo.|
00000004

GNU sed does do the right thing with escape chars (verified on Fedora 17):

# cat /etc/redhat-release
Fedora release 17 (Beefy Miracle)
# echo foot | sed -e 's/[\t ]*$//' | hexdump -C
00000000  66 6f 6f 74 0a                                    |foot.|
00000005
# echo "foot " | sed -e 's/[\t ]*$//' | hexdump -C
00000000  66 6f 6f 74 0a                                    |foot.|
00000005

How-To-Repeat: echo foot | sed -e 's/[\t ]*$//'
Comment 1 johnandsara2 2012-10-21 01:37:25 UTC
Hi.  This should be files as a Question not a Bug.  You need to investigate things before claiming 
"bug".


 > $ echo "foot " | sed -e 's/[\\t ]*$//' | hexdump -C

What I see is your using [] wrong.  '\t' doesn't go inside it for many WELL documented reasons.

Furthermore sed mathes '\t' without [] so why bother filing a bug against [] ?

'\t' has been around more years than your were born and probably sed too...

anyhow.  "ediquitte" is you ask questions to users and file bugs about only if your very as 
professor sure and also know how to fix it (pref. wtih code patch filed which absolutely CANNOT 
cause past software to stop working).


p.s.

So, when you see a "majorly relied upon" unix tool doing "something stupid"
it's very likely you have more learning and asking to do.

1) you don't want people hacking sed to make it work your way it would break
    past software

2) you don't want BSD-SED to have false bugs filed it will make BSD look worse
    than it is

3)  since early stable versions of free bsd, bsd has accumulated
    way too many bugs and fixes and hardware compat changes - about
    a suspicious ammount - and still isn't as stable as it had been )
Comment 2 Enji Cooper freebsd_committer freebsd_triage 2012-10-21 02:23:26 UTC
On Sat, Oct 20, 2012 at 5:37 PM, John D. Hendrickson and Sara Darnell
<johnandsara2@cox.net> wrote:

...

> What I see is your using [] wrong.  '\t' doesn't go inside it for many WELL
> documented reasons.

Cite documentation please and explain where I got it wrong.

> Furthermore sed mathes '\t' without [] so why bother filing a bug against []
> ?

The point was to match one or the other as this was something that
someone else wrote that's a valid regular expression in bracket
notation that I executed on both FreeBSD and Linux. '\040' != '\011'
and as such a space shouldn't be a drop in replacement for a
horizontal tab. The PR was filed to note the discrepancy.

Thanks,
-Garrett
Comment 3 Yuri Pankov 2017-12-04 05:41:38 UTC
Despite what comment #1 says, sed does NOT match literal "\t" to a tab character outside [] as well -- quoting http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html:

9.3.2 BRE Ordinary Characters

The interpretation of an ordinary character preceded by an unescaped <backslash> ('\\') is undefined.

9.4.2 ERE Ordinary Characters

The interpretation of an ordinary character preceded by an unescaped <backslash> ( '\\' ) is undefined.

There's an exception that is important here that <backslash> loses it's special meaning inside the bracket expression, so in your examples the bracket expression "[\t ]" correctly matches the 't' character and whitespace.

Given the above, GNU sed actually violates the standard which defines the 's' command as the following:

[2addr]s/BRE/replacement/flags

...so all BRE (ERE with -E) rules apply here.

I would agree that *outside* of the bracket expression we could make "\t" match the tab character (that's what libtre does apparently) as it's more readable than inserting literal tab characters in RE, but inside the bracket expression GNU sed is clearly wrong, and our sed (through regex(3)) is doing the right thing.

(I just hope I'm understanding everything correctly here, of course)
Comment 4 Eitan Adler freebsd_committer freebsd_triage 2018-05-20 23:54:17 UTC
For bugs matching the following conditions:
- Status == In Progress
- Assignee == "bugs@FreeBSD.org"
- Last Modified Year <= 2017

Do
- Set Status to "Open"
Comment 5 Kyle Evans freebsd_committer freebsd_triage 2020-06-07 20:52:28 UTC
\t (and \r and \n) was properly implemented in r355590 and MFC'd to stable/12 in r355902.

(In reply to Yuri Pankov from comment #3)

I tend to agree that it's a standards violation based on my re-reading of the spec, but I would propose that it's an OK violation in the name of reasonable expectations and not making users' lives harder. \t => t in a bracketed expression serves little value and doesn't accomplish anything you can't do otherwise, but being able to [ \t] to capture the subset of [:space:] that might represent leading indentation you want to normalize.