[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug
From: |
Hermann Peifer |
Subject: |
Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug |
Date: |
Fri, 29 Jan 2016 20:42:09 +0100 |
User-agent: |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 |
On 2016-01-29 17:02, Michael Klement wrote:
> Thanks, Hermann.
>
> LC_ALL=C is an effective workaround for the case at hand, though it
> precludes working with Unicode characters as such in the rest of the
> script (which may never be needed).
>
> Another, workaround, though not fully equivalent, is:
>
> echo 'hät' | gawk '{ gsub(/[^\x00-\x7F]/, ""); print }'
>
>
> This works without LC_ALL=C, but excludes ALL non-ASCII characters, not
> just those in the range 128 - 255.
>
> Which brings me to a question (couldn't figure it out from the docs):
>
> Are the \x.. escapes inside bracket expressions *supposed* to work with
> *all Unicode* codepoints?
> In other words: in an UTF-8 locale, *can you specify Unicode code-point
> ranges* (that go way beyond 0xFF) rather than just individual-byte ranges?
>
> The following does appear to work in locale "en_US.UTF-8", but it may be
> accidental:
>
> # Exclude all non-ASCII chars (exclude the entire non-ASCII Unicode
> codepoint range).
>
> echo 'hät' | gawk '{ gsub(/[^\x80-\x10f7ff]/, ""); print }'
>
>
> The crash prevents me from testing the complement.
>
>
> Obviously, without a construct to delimit the hex digits ({…}
> doesn't work), there's ambiguity.
>
>
> Either way, I suggest clarifying the behavior
> at
> https://www.gnu.org/software/gawk/manual/html_node/Bracket-Expressions.html#Bracket-Expressions
>
>
In my understanding, bracket expressions like [\x80-\xff] are about byte
ranges, not code point ranges. I might be wrong though and hope that
others on the list can help.
About your last example: see below what I am getting here.
Hermann
I
$ # gawk 4.1.3: *seems* to work, by accident?
$ echo 'hät' | /opt/local/bin/gawk '{ gsub(/[^\x80-\x10f7ff]/, ""); print }'
ä
$ # gawk 4.1.3: why are those chars in the middle gone?
$ echo ÄÖÜŚŜŹŻŽÅØÆ | /opt/local/bin/gawk '{ gsub(/[^\x80-\x10f7ff]/,
""); print }'
ÄÖÜÅØÆ
$ # gawk/master doesn't like the range in the first place
$ echo 'hät' | gawk '{ gsub(/[^\x80-\x10f7ff]/, ""); print }'
gawk: cmd. line:1: error: Invalid collation character: /[^�-f7ff]/