|
From: | Michael Klement |
Subject: | Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug |
Date: | Fri, 29 Jan 2016 11:02:30 -0500 |
Thanks, Hermann. LC_ALL=C is an effective workaround for the case at hand, though it precludes working with Unicode characters as such in the rest of the script (which may never be needed). Another, workaround, though not fully equivalent, is:
This works without LC_ALL=C, but excludes ALL non-ASCII characters, not just those in the range 128 - 255. Which brings me to a question (couldn't figure it out from the docs): Are the \x.. escapes inside bracket expressions *supposed* to work with *all Unicode* codepoints? In other words: in an UTF-8 locale, *can you specify Unicode code-point ranges* (that go way beyond 0xFF) rather than just individual-byte ranges? The following does appear to work in locale "en_US.UTF-8", but it may be accidental: # Exclude all non-ASCII chars (exclude the entire non-ASCII Unicode codepoint range). echo 'hät' | gawk '{ gsub(/[^\x80-\x10f7ff]/, ""); print }' The crash prevents me from testing the complement. Obviously, without a construct to delimit the hex digits ({…} doesn't work), there's ambiguity. Either way, I suggest clarifying the behavior at https://www.gnu.org/software/gawk/manual/html_node/Bracket-Expressions.html#Bracket-Expressions Michael
|
[Prev in Thread] | Current Thread | [Next in Thread] |