help-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

How Is \B Supposed to Work in Regexps?


From: Neil R. Ormos
Subject: How Is \B Supposed to Work in Regexps?
Date: Tue, 6 Sep 2022 15:19:47 -0500 (CDT)

The \B regexp operator doesn't appear to work as described in the manual.

In manual Section "3.7 gawk-Specific Regexp Operators", \B is said to match

| the empty string that occurs between two
| word-constituent characters. For example,
| /\Brat\B/ matches 'crate', but it does not match
| 'dirty rat'. '\B' is essentially the opposite of
| '\y'.

\B seems to match even strings that contain no word-constituent characters.

The little test program in the examples below tries to match() a one-element 
regexp, either \w, \y, or \B, against various test strings in $0.  The first 
print displays the value from match(), followed by $0 sandwiched between two 
"|" characters.  The second print places a caret ("^") under $0, as printed 
above, at the position of the regexp identified by match().

Output lines 9, 12, and 15 show that " " and "/" are not word-constituent 
characters, while "a" is a word-constituent.

In output lines 19, 25, and 31, searching for \y, the empty string at the 
beginning or end of a word, the results are as expected: match() returns the 
position of the first "a" in $0.  Likewise, in output lines 22 and 28, there 
are no word-constituent characters in the corresponding $0, and match() returns 
0.

Lines 37-51 involve matching \B, and I don't understand those results.

For output line 34, the input string "aaa" starts with a run of 
word-constituent characters.  The result is 2, as expected.

For output line 46, the input string "a" is exactly one word-constituent 
character.  The result is 0, as expected, because there is no "empty string 
between two word-constituent characters".

For output line 41, using input string "   aaa", the run of word-constituent 
characters begins in position 4, yet match() returns 1.  I would have expected 
5.

For output lines 38 and 44, the input strings have no word-constituent 
characters, yet match() again returns 1.  I would have expected 0.

For output line 50, input string "a/a/a/", there are no pairs of adjacent 
word-constituent characters, and therefore, there should be no "empty string 
between two word-constituent characters".  Here, match() returns 7, identifying 
a position outside the six-character input string.  Again, I would have 
expected 0.

I'm not sure whether these results show a bug in Gawk (or the regexp library or 
libraries it uses), a bug in the manual, my error in interpretation, or some 
other PBCAK error.  Any insights?

(The gawk executable referenced in the examples was built from the most recent 
release, but I think I get the same results from an ancient gawk.)

############################################################
############################################################

 8      echo " "      | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\w/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
 9      0 | |
10        ^
11      echo "a"      | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\w/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
12      1 |a|
13         ^
14      echo "/"      | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\w/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
15      0 |/|
16        ^
17
18      echo "aaa"    | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
19      1 |aaa|
20         ^
21      echo "///"    | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
22      0 |///|
23        ^
24      echo "   aaa" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
25      4 |   aaa|
26            ^
27      echo "   ///" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
28      0 |   ///|
29        ^
30      echo "a/a/a/" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
31      1 |a/a/a/|
32         ^
33
34      echo "aaa"    | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
35      2 |aaa|
36          ^
37      echo "///"    | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
38      1 |///|
39         ^
40      echo "   aaa" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
41      1 |   aaa|
42         ^
43      echo "   ///" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
44      1 |   ///|
45         ^
46      echo "a"      | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
47      0 |a|
48        ^
49      echo "a/a/a/" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print p " 
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1, 
length(p)+2+p-1) "^";}'
50      7 |a/a/a/|
51               ^

############################################################



reply via email to

[Prev in Thread] Current Thread [Next in Thread]