[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
How Is \B Supposed to Work in Regexps?
From: |
Neil R. Ormos |
Subject: |
How Is \B Supposed to Work in Regexps? |
Date: |
Tue, 6 Sep 2022 15:19:47 -0500 (CDT) |
The \B regexp operator doesn't appear to work as described in the manual.
In manual Section "3.7 gawk-Specific Regexp Operators", \B is said to match
| the empty string that occurs between two
| word-constituent characters. For example,
| /\Brat\B/ matches 'crate', but it does not match
| 'dirty rat'. '\B' is essentially the opposite of
| '\y'.
\B seems to match even strings that contain no word-constituent characters.
The little test program in the examples below tries to match() a one-element
regexp, either \w, \y, or \B, against various test strings in $0. The first
print displays the value from match(), followed by $0 sandwiched between two
"|" characters. The second print places a caret ("^") under $0, as printed
above, at the position of the regexp identified by match().
Output lines 9, 12, and 15 show that " " and "/" are not word-constituent
characters, while "a" is a word-constituent.
In output lines 19, 25, and 31, searching for \y, the empty string at the
beginning or end of a word, the results are as expected: match() returns the
position of the first "a" in $0. Likewise, in output lines 22 and 28, there
are no word-constituent characters in the corresponding $0, and match() returns
0.
Lines 37-51 involve matching \B, and I don't understand those results.
For output line 34, the input string "aaa" starts with a run of
word-constituent characters. The result is 2, as expected.
For output line 46, the input string "a" is exactly one word-constituent
character. The result is 0, as expected, because there is no "empty string
between two word-constituent characters".
For output line 41, using input string " aaa", the run of word-constituent
characters begins in position 4, yet match() returns 1. I would have expected
5.
For output lines 38 and 44, the input strings have no word-constituent
characters, yet match() again returns 1. I would have expected 0.
For output line 50, input string "a/a/a/", there are no pairs of adjacent
word-constituent characters, and therefore, there should be no "empty string
between two word-constituent characters". Here, match() returns 7, identifying
a position outside the six-character input string. Again, I would have
expected 0.
I'm not sure whether these results show a bug in Gawk (or the regexp library or
libraries it uses), a bug in the manual, my error in interpretation, or some
other PBCAK error. Any insights?
(The gawk executable referenced in the examples was built from the most recent
release, but I think I get the same results from an ancient gawk.)
############################################################
############################################################
8 echo " " | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\w/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
9 0 | |
10 ^
11 echo "a" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\w/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
12 1 |a|
13 ^
14 echo "/" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\w/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
15 0 |/|
16 ^
17
18 echo "aaa" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
19 1 |aaa|
20 ^
21 echo "///" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
22 0 |///|
23 ^
24 echo " aaa" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
25 4 | aaa|
26 ^
27 echo " ///" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
28 0 | ///|
29 ^
30 echo "a/a/a/" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
31 1 |a/a/a/|
32 ^
33
34 echo "aaa" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
35 2 |aaa|
36 ^
37 echo "///" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
38 1 |///|
39 ^
40 echo " aaa" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
41 1 | aaa|
42 ^
43 echo " ///" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
44 1 | ///|
45 ^
46 echo "a" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
47 0 |a|
48 ^
49 echo "a/a/a/" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print p "
|" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
length(p)+2+p-1) "^";}'
50 7 |a/a/a/|
51 ^
############################################################
- How Is \B Supposed to Work in Regexps?,
Neil R. Ormos <=