[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
branch-1_4 8-bit clean translit
From: |
Eric Blake |
Subject: |
branch-1_4 8-bit clean translit |
Date: |
Sat, 11 Nov 2006 05:54:26 -0700 |
User-agent: |
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.8) Gecko/20061025 Thunderbird/1.5.0.8 Mnenhy/0.7.4.666 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
$ cat <<\EOF | m4
translit(`«abc~', `~-»')
EOF
«
Oops - ranges that extended across the 0x7f-0x80 boundary misbehaved on
machines where char is signed. Also, our testsuite assumes ASCII in the
translit tests, but so far no one has reported failures when porting to
EBCDIC platforms (where A-Z is more than just 26 letters), so I doubt it
is worth worrying about.
2006-11-11 Eric Blake <address@hidden>
* src/builtin.c: Remove unnecessary casts.
(expand_ranges): Make 8-bit clean.
* doc/m4.texinfo (Translit): Add tests and wording.
* NEWS: Document this fix.
- --
Life is short - so eat dessert first!
Eric Blake address@hidden
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFVcgC84KuGfSFAYARAqwsAKC16u8NpG48in0OOMQslWt66JxO9QCguVHT
wpE2vBw5R1xMMN431yt6WE4=
=fE0O
-----END PGP SIGNATURE-----
Index: NEWS
===================================================================
RCS file: /sources/m4/m4/NEWS,v
retrieving revision 1.1.1.1.2.79
diff -u -p -r1.1.1.1.2.79 NEWS
--- NEWS 1 Nov 2006 13:44:53 -0000 1.1.1.1.2.79
+++ NEWS 11 Nov 2006 12:48:54 -0000
@@ -43,7 +43,8 @@ Version 1.4.8 - ?? ??? 2006, by ?? (CVS
* The `changecom' and `changequote' macros now treat an empty second
argument the same as if it were missing, rather than using the empty
string and making it impossible to end a comment or quote.
-* The `translit' macro now operates in linear instead of quadratic time.
+* The `translit' macro now operates in linear instead of quadratic time,
+ and is now eight-bit clean.
* The `-D', `-U', `-s', and `-t' command line options now take effect
after any files encountered earlier on the command line, rather than up
front, as is done in traditional implementations and required by POSIX.
Index: doc/m4.texinfo
===================================================================
RCS file: /sources/m4/m4/doc/m4.texinfo,v
retrieving revision 1.1.1.1.2.99
diff -u -p -r1.1.1.1.2.99 m4.texinfo
--- doc/m4.texinfo 8 Nov 2006 05:08:26 -0000 1.1.1.1.2.99
+++ doc/m4.texinfo 11 Nov 2006 12:48:56 -0000
@@ -2828,9 +2828,9 @@ foo
The quotation strings can safely contain eight-bit characters.
@ignore
-Yuck. I know of no clean way to render an 8-bit character in both info
-and dvi. This example uses the `open-guillemot' and `close-guillemot'
-characters of the Latin-1 character set.
address@hidden Yuck. I know of no clean way to render an 8-bit character in
address@hidden both info and dvi. This example uses the `open-guillemot' and
address@hidden `close-guillemot' characters of the Latin-1 character set.
@example
define(`a', `b')
@@ -3058,9 +3058,9 @@ changecom(`#', `')
The comment strings can safely contain eight-bit characters.
@ignore
-Yuck. I know of no clean way to render an 8-bit character in both info
-and dvi. This example uses the `open-guillemot' and `close-guillemot'
-characters of the Latin-1 character set.
address@hidden Yuck. I know of no clean way to render an 8-bit character in
address@hidden both info and dvi. This example uses the `open-guillemot' and
address@hidden `close-guillemot' characters of the Latin-1 character set.
@example
define(`a', `b')
@@ -4134,14 +4134,15 @@ translation pass is made, even if charac
appear in @var{chars}.
As a @acronym{GNU} extension, both @var{chars} and @var{replacement} can
-contain character-ranges,
-e.g., @samp{a-z} (meaning all lowercase letters) or @samp{0-9} (meaning
-all digits). To include a dash @samp{-} in @var{chars} or
address@hidden, place it first or last.
-
-It is not an error for the last character in the range to be `larger'
-than the first. In that case, the range runs backwards, i.e.,
address@hidden means the string @samp{9876543210}.
+contain character-ranges, e.g., @samp{a-z} (meaning all lowercase
+letters) or @samp{0-9} (meaning all digits). To include a dash @samp{-}
+in @var{chars} or @var{replacement}, place it first or last in the
+entire string, or as the last character of a range. Back-to-back ranges
+can share a common endpoint. It is not an error for the last character
+in the range to be `larger' than the first. In that case, the range
+runs backwards, i.e., @samp{9-0} means the string @samp{9876543210}.
+The expansion of a range is dependent on the underlying encoding of
+characters, so using ranges is not always portable between machines.
The macro @code{translit} is recognized only with parameters.
@end deffn
@@ -4153,17 +4154,31 @@ translit(`GNUs not Unix', `a-z', `A-Z')
@result{}GNUS NOT UNIX
translit(`GNUs not Unix', `A-Z', `z-a')
@result{}tmfs not fnix
+translit(`+,-12345', `+--1-5', `<;>a-c-a')
address@hidden<;>abcba
translit(`abcdef', `aabdef', `bcged')
@result{}bgced
@end example
-The first example deletes all uppercase letters, the second converts
-lowercase to uppercase, and the third `mirrors' all uppercase letters,
-while converting them to lowercase. The two first cases are by far the
-most common. The final example shows that @samp{a} is mapped to
address@hidden, not @samp{c}; the resulting @samp{b} is not further remapped
-to @samp{g}; the @samp{d} and @samp{e} are swapped, and the @samp{f} is
-discarded.
+In the @sc{ascii} encoding, the first example deletes all uppercase
+letters, the second converts lowercase to uppercase, and the third
+`mirrors' all uppercase letters, while converting them to lowercase.
+The two first cases are by far the most common, even though they are not
+portable to @sc{ebcdic} or other encodings. The fourth example shows a
+range ending in @samp{-}, as well as back-to-back ranges. The final
+example shows that @samp{a} is mapped to @samp{b}, not @samp{c}; the
+resulting @samp{b} is not further remapped to @samp{g}; the @samp{d} and
address@hidden are swapped, and the @samp{f} is discarded.
+
address@hidden
address@hidden No need to fight 8-bit characters, as it is difficult to get
address@hidden rendering right in both info and dvi.
+
address@hidden
+translit(`«abc~', `~-»')
address@hidden
address@hidden example
address@hidden ignore
Omitting @var{chars} evokes a warning, but still produces output.
Index: src/builtin.c
===================================================================
RCS file: /sources/m4/m4/src/Attic/builtin.c,v
retrieving revision 1.1.1.1.2.50
diff -u -p -r1.1.1.1.2.50 builtin.c
--- src/builtin.c 1 Nov 2006 22:29:08 -0000 1.1.1.1.2.50
+++ src/builtin.c 11 Nov 2006 12:48:56 -0000
@@ -359,12 +359,12 @@ numeric_arg (token_data *macro, const ch
static char const digits[] = "0123456789abcdefghijklmnopqrstuvwxyz";
static const char *
-ntoa (register eval_t value, int radix)
+ntoa (eval_t value, int radix)
{
bool negative;
unsigned_eval_t uvalue;
static char str[256];
- register char *s = &str[sizeof str];
+ char *s = &str[sizeof str];
*--s = '\0';
@@ -667,9 +667,9 @@ m4_dumpdef (struct obstack *obs, int arg
/* Make table of symbols invisible to expand_macro (). */
- (void) obstack_finish (obs);
+ obstack_finish (obs);
- qsort ((char *) data.base, data.size, sizeof (symbol *), dumpdef_cmp);
+ qsort (data.base, data.size, sizeof (symbol *), dumpdef_cmp);
for (; data.size > 0; --data.size, data.base++)
{
@@ -1645,14 +1645,14 @@ m4_substr (struct obstack *obs, int argc
static const char *
expand_ranges (const char *s, struct obstack *obs)
{
- char from;
- char to;
+ unsigned char from;
+ unsigned char to;
- for (from = '\0'; *s != '\0'; from = *s++)
+ for (from = '\0'; *s != '\0'; from = to_uchar (*s++))
{
if (*s == '-' && from != '\0')
{
- to = *++s;
+ to = to_uchar (*++s);
if (to == '\0')
{
/* trailing dash */
@@ -1772,7 +1772,7 @@ static void
substitute (struct obstack *obs, const char *victim, const char *repl,
struct re_registers *regs)
{
- register unsigned int ch;
+ int ch;
for (;;)
{
@@ -2031,7 +2031,7 @@ void
expand_user_macro (struct obstack *obs, symbol *sym,
int argc, token_data **argv)
{
- register const char *text;
+ const char *text;
int i;
for (text = SYMBOL_TEXT (sym); *text != '\0';)
- branch-1_4 8-bit clean translit,
Eric Blake <=