[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
translit speedup
From: |
Eric Blake |
Subject: |
translit speedup |
Date: |
Tue, 31 Oct 2006 07:05:20 -0700 |
User-agent: |
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Thunderbird/1.5.0.7 Mnenhy/0.7.4.666 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I noticed that translit was quadratic - consider translit(a<5000 b's>,
<5000 a's>b, c<5000 d's>) - it was doing more than 25 million comparisons
to result in c<5000 d's>. With this patch, that is reduced to several
thousand comparisons. It also fixes index to use strstr - partly because
our implementation was not as efficient as gnulib's implementation for
single-byte locales (but the improvement is not as drastic as translit's),
and partly because it is one step closer to proper locale support (but
right now, supporting locales is a post 2.0 goal for me).
branch:
2006-10-31 Eric Blake <address@hidden>
* m4/gnulib-cache.m4: Augment with 'gnulib-tool --import strstr'.
* doc/m4.texinfo (Translit): Improve the documentation.
* src/builtin.c (m4_translit): Optimize to O(n) instead of O(n^2)
algorithm.
(m4_index): Simplify, and speed up slightly.
* NEWS: Document this fix.
head:
2006-10-31 Eric Blake <address@hidden>
* ltdl/m4/gnulib-cache.m4: Augment with 'gnulib-tool --import
strstr'.
* doc/m4.texinfo (Translit): Merge from branch.
* modules/m4.c (divert, substr): Ignore excess arguments.
(index, translit): Merge from branch.
* tests/builtins.at (translit): Add a test.
- --
Life is short - so eat dessert first!
Eric Blake address@hidden
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFR1gf84KuGfSFAYARAiJRAJ92C6bnQh7QPmn2FodmrXtionErGQCg0mp6
+0OLVh7CZT5l+vIESpYfnAQ=
=W8sy
-----END PGP SIGNATURE-----
Index: NEWS
===================================================================
RCS file: /sources/m4/m4/NEWS,v
retrieving revision 1.1.1.1.2.77
diff -u -p -r1.1.1.1.2.77 NEWS
--- NEWS 29 Oct 2006 15:22:42 -0000 1.1.1.1.2.77
+++ NEWS 31 Oct 2006 13:18:16 -0000
@@ -43,6 +43,7 @@ Version 1.4.8 - ?? ??? 2006, by ?? (CVS
* The `changecom' and `changequote' macros now treat an empty second
argument the same as if it were missing, rather than using the empty
string and making it impossible to end a comment or quote.
+* The `translit' macro now operates in linear instead of quadratic time.
Version 1.4.7 - 25 September 2006, by Eric Blake (CVS version 1.4.6a)
Index: doc/m4.texinfo
===================================================================
RCS file: /sources/m4/m4/doc/m4.texinfo,v
retrieving revision 1.1.1.1.2.95
diff -u -p -r1.1.1.1.2.95 m4.texinfo
--- doc/m4.texinfo 29 Oct 2006 15:22:42 -0000 1.1.1.1.2.95
+++ doc/m4.texinfo 31 Oct 2006 13:18:19 -0000
@@ -4080,9 +4080,14 @@ Expands to @var{string}, with each chara
the same index.
If @var{replacement} is shorter than @var{chars}, the excess characters
-are deleted from the expansion. If @var{replacement} is omitted, all
-characters in @var{string} that are present in @var{chars} are deleted
-from the expansion.
+of @var{chars} are deleted from the expansion; if @var{chars} is
+shorter, the excess characters in @var{replacement} are silently
+ignored. If @var{replacement} is omitted, all characters in
address@hidden that are present in @var{chars} are deleted from the
+expansion. If a character appears more than once in @var{chars}, only
+the first instance is used in making the translation. Only a single
+translation pass is made, even if characters in @var{replacement} also
+appear in @var{chars}.
As a @acronym{GNU} extension, both @var{chars} and @var{replacement} can
contain character-ranges,
@@ -4104,12 +4109,17 @@ translit(`GNUs not Unix', `a-z', `A-Z')
@result{}GNUS NOT UNIX
translit(`GNUs not Unix', `A-Z', `z-a')
@result{}tmfs not fnix
+translit(`abcdef', `aabdef', `bcged')
address@hidden
@end example
The first example deletes all uppercase letters, the second converts
lowercase to uppercase, and the third `mirrors' all uppercase letters,
while converting them to lowercase. The two first cases are by far the
-most common.
+most common. The final example shows that @samp{a} is mapped to
address@hidden, not @samp{c}; the resulting @samp{b} is not further remapped
+to @samp{g}; the @samp{d} and @samp{e} are swapped, and the @samp{f} is
+discarded.
Omitting @var{chars} evokes a warning, but still produces output.
Index: m4/gnulib-cache.m4
===================================================================
RCS file: /sources/m4/m4/m4/Attic/gnulib-cache.m4,v
retrieving revision 1.1.2.17
diff -u -p -r1.1.2.17 gnulib-cache.m4
--- m4/gnulib-cache.m4 17 Oct 2006 16:13:03 -0000 1.1.2.17
+++ m4/gnulib-cache.m4 31 Oct 2006 13:18:19 -0000
@@ -15,11 +15,11 @@
# Specification in the form of a command-line invocation:
-# gnulib-tool --import --dir=. --lib=libm4 --source-base=lib --m4-base=m4
--doc-base=doc --aux-dir=. --no-libtool --macro-prefix=M4 binary-io clean-temp
cloexec close-stream closeout config-h error fdl fopen-safer free gendocs
getopt gnupload mkstemp obstack regex stdlib-safer strtol unlocked-io verror
xalloc xvasprintf
+# gnulib-tool --import --dir=. --lib=libm4 --source-base=lib --m4-base=m4
--doc-base=doc --aux-dir=. --no-libtool --macro-prefix=M4 binary-io clean-temp
cloexec close-stream closeout config-h error fdl fopen-safer free gendocs
getopt gnupload mkstemp obstack regex stdlib-safer strstr strtol unlocked-io
verror xalloc xvasprintf
# Specification in the form of a few gnulib-tool.m4 macro invocations:
gl_LOCAL_DIR([])
-gl_MODULES([binary-io clean-temp cloexec close-stream closeout config-h error
fdl fopen-safer free gendocs getopt gnupload mkstemp obstack regex stdlib-safer
strtol unlocked-io verror xalloc xvasprintf])
+gl_MODULES([binary-io clean-temp cloexec close-stream closeout config-h error
fdl fopen-safer free gendocs getopt gnupload mkstemp obstack regex stdlib-safer
strstr strtol unlocked-io verror xalloc xvasprintf])
gl_AVOID([])
gl_SOURCE_BASE([lib])
gl_M4_BASE([m4])
Index: src/builtin.c
===================================================================
RCS file: /sources/m4/m4/src/Attic/builtin.c,v
retrieving revision 1.1.1.1.2.47
diff -u -p -r1.1.1.1.2.47 builtin.c
--- src/builtin.c 29 Oct 2006 15:22:42 -0000 1.1.1.1.2.47
+++ src/builtin.c 31 Oct 2006 13:18:19 -0000
@@ -27,6 +27,7 @@
extern FILE *popen ();
#include "regex.h"
+#include "strstr.h"
#if HAVE_SYS_WAIT_H
# include <sys/wait.h>
@@ -1575,8 +1576,9 @@ m4_len (struct obstack *obs, int argc, t
static void
m4_index (struct obstack *obs, int argc, token_data **argv)
{
- const char *cp, *last;
- int l1, l2, retval;
+ const char *haystack;
+ const char *result;
+ int retval;
if (bad_argc (argv[0], argc, 3, 3))
{
@@ -1586,17 +1588,9 @@ m4_index (struct obstack *obs, int argc,
return;
}
- l1 = strlen (ARG (1));
- l2 = strlen (ARG (2));
-
- last = ARG (1) + l1 - l2;
-
- for (cp = ARG (1); cp <= last; cp++)
- {
- if (strncmp (cp, ARG (2), l2) == 0)
- break;
- }
- retval = (cp <= last) ? cp - ARG (1) : -1;
+ haystack = ARG (1);
+ result = strstr (haystack, ARG (2));
+ retval = result ? result - haystack : -1;
shipout_int (obs, retval);
}
@@ -1693,9 +1687,11 @@ expand_ranges (const char *s, struct obs
static void
m4_translit (struct obstack *obs, int argc, token_data **argv)
{
- register const char *data, *tmp;
- const char *from, *to;
- int tolen;
+ const unsigned char *data;
+ const unsigned char *from;
+ const unsigned char *to;
+ char map[256] = {0};
+ char found[256] = {0};
if (bad_argc (argv[0], argc, 3, 4))
{
@@ -1713,33 +1709,37 @@ m4_translit (struct obstack *obs, int ar
return;
}
- if (argc >= 4)
+ to = ARG (3);
+ if (strchr (to, '-') != NULL)
+ {
+ to = expand_ranges (to, obs);
+ if (to == NULL)
+ return;
+ }
+
+ /* Calling strchr(from) for each character in data is quadratic,
+ since both strings can be arbitrarily long. Instead, create a
+ from-to mapping in one pass of data, then use that map in one
+ pass of from, for linear behavior. Traditional behavior is that
+ only the first instance of a character in from is consulted,
+ hence the found map. */
+ for ( ; *from; from++)
{
- to = ARG (3);
- if (strchr (to, '-') != NULL)
+ if (! found[*from])
{
- to = expand_ranges (to, obs);
- if (to == NULL)
- return;
+ found[*from] = 1;
+ map[*from] = *to;
}
+ if (*to != '\0')
+ to++;
}
- else
- to = "";
-
- tolen = strlen (to);
for (data = ARG (1); *data; data++)
{
- tmp = strchr (from, *data);
- if (tmp == NULL)
- {
- obstack_1grow (obs, *data);
- }
- else
- {
- if (tmp - from < tolen)
- obstack_1grow (obs, *(to + (tmp - from)));
- }
+ if (! found[*data])
+ obstack_1grow (obs, *data);
+ else if (map[*data])
+ obstack_1grow (obs, map[*data]);
}
}
Index: ltdl/m4/gnulib-cache.m4
===================================================================
RCS file: /sources/m4/m4/ltdl/m4/gnulib-cache.m4,v
retrieving revision 1.18
diff -u -p -r1.18 gnulib-cache.m4
--- ltdl/m4/gnulib-cache.m4 28 Oct 2006 19:37:47 -0000 1.18
+++ ltdl/m4/gnulib-cache.m4 31 Oct 2006 14:04:45 -0000
@@ -15,11 +15,11 @@
# Specification in the form of a command-line invocation:
-# gnulib-tool --import --dir=. --lib=libgnu --source-base=gnu
--m4-base=ltdl/m4 --doc-base=doc --aux-dir=ltdl/config --libtool
--macro-prefix=M4 assert avltree-list binary-io clean-temp cloexec close-stream
closeout config-h configmake dirname error exit fdl filenamecat fopen-safer
free gendocs gettext gnupload mkstemp obstack progname regex regexprops-generic
stdbool stdlib-safer strnlen strtol tempname unlocked-io verror xalloc
xalloc-die xstrndup xstrtol xvasprintf
+# gnulib-tool --import --dir=. --lib=libgnu --source-base=gnu
--m4-base=ltdl/m4 --doc-base=doc --aux-dir=ltdl/config --libtool
--macro-prefix=M4 assert avltree-list binary-io clean-temp cloexec close-stream
closeout config-h configmake dirname error exit fdl filenamecat fopen-safer
free gendocs gettext gnupload mkstemp obstack progname regex regexprops-generic
stdbool stdlib-safer strnlen strstr strtol tempname unlocked-io verror xalloc
xalloc-die xstrndup xstrtol xvasprintf
# Specification in the form of a few gnulib-tool.m4 macro invocations:
gl_LOCAL_DIR([])
-gl_MODULES([assert avltree-list binary-io clean-temp cloexec close-stream
closeout config-h configmake dirname error exit fdl filenamecat fopen-safer
free gendocs gettext gnupload mkstemp obstack progname regex regexprops-generic
stdbool stdlib-safer strnlen strtol tempname unlocked-io verror xalloc
xalloc-die xstrndup xstrtol xvasprintf])
+gl_MODULES([assert avltree-list binary-io clean-temp cloexec close-stream
closeout config-h configmake dirname error exit fdl filenamecat fopen-safer
free gendocs gettext gnupload mkstemp obstack progname regex regexprops-generic
stdbool stdlib-safer strnlen strstr strtol tempname unlocked-io verror xalloc
xalloc-die xstrndup xstrtol xvasprintf])
gl_AVOID([])
gl_SOURCE_BASE([gnu])
gl_M4_BASE([ltdl/m4])
Index: doc/m4.texinfo
===================================================================
RCS file: /sources/m4/m4/doc/m4.texinfo,v
retrieving revision 1.74
diff -u -p -r1.74 m4.texinfo
--- doc/m4.texinfo 27 Oct 2006 17:03:51 -0000 1.74
+++ doc/m4.texinfo 31 Oct 2006 14:04:48 -0000
@@ -4853,29 +4853,36 @@ The builtin macro @code{substr} is recog
@cindex translating characters
@cindex characters, translating
+Character translation is done with @code{translit}:
+
@deffn {Builtin (m4)} translit (@var{string}, @var{chars}, @var{replacement})
-Character translation is done with @code{translit}, which expands to
address@hidden, with each character that occurs in @var{chars} translated
-into the character from @var{replacement} with the same index.
+Expands to @var{string}, with each character that occurs in
address@hidden translated into the character from @var{replacement} with
+the same index.
If @var{replacement} is shorter than @var{chars}, the excess characters
-are deleted from the expansion. If @var{replacement} is omitted, all
-characters in @var{string}, that are present in @var{chars} are deleted
-from the expansion.
+of @var{chars} are deleted from the expansion; if @var{chars} is
+shorter, the excess characters in @var{replacement} are silently
+ignored. If @var{replacement} is omitted, all characters in
address@hidden that are present in @var{chars} are deleted from the
+expansion. If a character appears more than once in @var{chars}, only
+the first instance is used in making the translation. Only a single
+translation pass is made, even if characters in @var{replacement} also
+appear in @var{chars}.
-Both @var{chars} and @var{replacement} can contain character-ranges,
+As a @acronym{GNU} extension, both @var{chars} and @var{replacement} can
+contain character-ranges,
e.g., @samp{a-z} (meaning all lowercase letters) or @samp{0-9} (meaning
all digits). To include a dash @samp{-} in @var{chars} or
@var{replacement}, place it first or last.
-The builtin macro @code{translit} is recognized only when given
-arguments.
address@hidden deffn
-
-It is not an error for the last character in the range to be `smaller'
+It is not an error for the last character in the range to be `larger'
than the first. In that case, the range runs backwards, i.e.,
@samp{9-0} means the string @samp{9876543210}.
+The macro @code{translit} is recognized only with parameters.
address@hidden deffn
+
@example
translit(`GNUs not Unix', `A-Z')
@result{}s not nix
@@ -4883,12 +4890,25 @@ translit(`GNUs not Unix', `a-z', `A-Z')
@result{}GNUS NOT UNIX
translit(`GNUs not Unix', `A-Z', `z-a')
@result{}tmfs not fnix
+translit(`abcdef', `aabdef', `bcged')
address@hidden
@end example
The first example deletes all uppercase letters, the second converts
lowercase to uppercase, and the third `mirrors' all uppercase letters,
while converting them to lowercase. The two first cases are by far the
-most common.
+most common. The final example shows that @samp{a} is mapped to
address@hidden, not @samp{c}; the resulting @samp{b} is not further remapped
+to @samp{g}; the @samp{d} and @samp{e} are swapped, and the @samp{f} is
+discarded.
+
+Omitting @var{chars} evokes a warning, but still produces output.
+
address@hidden
+translit(`abc')
address@hidden:stdin:1: Warning: translit: too few arguments: 1 < 2
address@hidden
address@hidden example
@node Patsubst
@section Substituting text by regular expression
Index: modules/m4.c
===================================================================
RCS file: /sources/m4/m4/modules/m4.c,v
retrieving revision 1.89
diff -u -p -r1.89 m4.c
--- modules/m4.c 31 Oct 2006 02:05:37 -0000 1.89
+++ modules/m4.c 31 Oct 2006 14:04:48 -0000
@@ -23,6 +23,7 @@
#include <errno.h>
#include "stdlib--.h"
+#include "strstr.h"
#include "tempname.h"
#include "unistd--.h"
@@ -550,7 +551,7 @@ M4BUILTIN_HANDLER (divert)
{
int i = 0;
- if (argc == 2 && !m4_numeric_arg (context, argc, argv, 1, &i))
+ if (argc >= 2 && !m4_numeric_arg (context, argc, argv, 1, &i))
return;
m4_make_diversion (context, i);
@@ -878,20 +879,9 @@ M4BUILTIN_HANDLER (len)
argument. */
M4BUILTIN_HANDLER (index)
{
- const char *cp, *last;
- int l1, l2, retval;
-
- l1 = strlen (M4ARG (1));
- l2 = strlen (M4ARG (2));
-
- last = M4ARG (1) + l1 - l2;
-
- for (cp = M4ARG (1); cp <= last; cp++)
- {
- if (strncmp (cp, M4ARG (2), l2) == 0)
- break;
- }
- retval = (cp <= last) ? cp - M4ARG (1) : -1;
+ const char *haystack = M4ARG (1);
+ const char *result = strstr (haystack, M4ARG (2));
+ int retval = result ? result - haystack : -1;
m4_shipout_int (obs, retval);
}
@@ -908,7 +898,7 @@ M4BUILTIN_HANDLER (substr)
if (!m4_numeric_arg (context, argc, argv, 2, &start))
return;
- if (argc == 4 && !m4_numeric_arg (context, argc, argv, 3, &length))
+ if (argc >= 4 && !m4_numeric_arg (context, argc, argv, 3, &length))
return;
if (start < 0 || length <= 0 || start >= avail)
@@ -969,45 +959,49 @@ m4_expand_ranges (const char *s, m4_obst
deleted from the first (pueh) */
M4BUILTIN_HANDLER (translit)
{
- register const char *data, *tmp;
- const char *from, *to;
- int tolen;
+ const unsigned char *data;
+ const unsigned char *from;
+ const unsigned char *to;
+ char map[256] = {0};
+ char found[256] = {0};
from = M4ARG (2);
if (strchr (from, '-') != NULL)
{
from = m4_expand_ranges (from, obs);
- if (from == NULL)
- return;
+ assert (from);
+ }
+
+ to = M4ARG (3);
+ if (strchr (to, '-') != NULL)
+ {
+ to = m4_expand_ranges (to, obs);
+ assert (to);
}
- if (argc == 4)
+ /* Calling strchr(from) for each character in data is quadratic,
+ since both strings can be arbitrarily long. Instead, create a
+ from-to mapping in one pass of data, then use that map in one
+ pass of from, for linear behavior. Traditional behavior is that
+ only the first instance of a character in from is consulted,
+ hence the found map. */
+ for ( ; *from; from++)
{
- to = M4ARG (3);
- if (strchr (to, '-') != NULL)
+ if (! found[*from])
{
- to = m4_expand_ranges (to, obs);
- if (to == NULL)
- return;
+ found[*from] = 1;
+ map[*from] = *to;
}
+ if (*to != '\0')
+ to++;
}
- else
- to = "";
-
- tolen = strlen (to);
for (data = M4ARG (1); *data; data++)
{
- tmp = strchr (from, *data);
- if (tmp == NULL)
- {
- obstack_1grow (obs, *data);
- }
- else
- {
- if (tmp - from < tolen)
- obstack_1grow (obs, *(to + (tmp - from)));
- }
+ if (! found[*data])
+ obstack_1grow (obs, *data);
+ else if (map[*data])
+ obstack_1grow (obs, map[*data]);
}
}
Index: tests/builtins.at
===================================================================
RCS file: /sources/m4/m4/tests/builtins.at,v
retrieving revision 1.29
diff -u -p -r1.29 builtins.at
--- tests/builtins.at 27 Oct 2006 15:16:31 -0000 1.29
+++ tests/builtins.at 31 Oct 2006 14:04:48 -0000
@@ -862,6 +862,16 @@ z
tmfs not fnix
]])
+dnl This used to be quadratic, taking more than 25 million comparisons,
+dnl but should now operate in linear time with only several thousand checks.
+AT_DATA([[in]],
+[[translit(`a]m4_for([i],[1],[5000],[],[[b]])[',
+ `]m4_for([i],[1],[5000],[],[[a]])[b',
+ `c]m4_for([i],[1],[5000],[],[[d]])[')
+]])
+AT_CHECK_M4([in], [0], [[c]m4_for([i],[1],[5000],[],[[d]])
+])
+
AT_CLEANUP
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- translit speedup,
Eric Blake <=