[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: How Is \B Supposed to Work in Regexps?
From: |
Wolfgang Laun |
Subject: |
Re: How Is \B Supposed to Work in Regexps? |
Date: |
Wed, 7 Sep 2022 17:18:54 +0200 |
\B matches when not at a word boundary
It might become clearer when you run the slightly modified matching
operation:
echo "///" | awk '{h = $0; p = gsub( /\B/, "!" ); print p " >" h "< >"
$0 "<";}'
4 >///< >!/!/!/!<
Where there is no word, there cannot be a word boundary.
Regards
Wolfgang
On Tue, 6 Sept 2022 at 22:35, Neil R. Ormos <ormos-gnulists17@ormos.org>
wrote:
> The \B regexp operator doesn't appear to work as described in the manual.
>
> In manual Section "3.7 gawk-Specific Regexp Operators", \B is said to match
>
> | the empty string that occurs between two
> | word-constituent characters. For example,
> | /\Brat\B/ matches 'crate', but it does not match
> | 'dirty rat'. '\B' is essentially the opposite of
> | '\y'.
>
> \B seems to match even strings that contain no word-constituent characters.
>
> The little test program in the examples below tries to match() a
> one-element regexp, either \w, \y, or \B, against various test strings in
> $0. The first print displays the value from match(), followed by $0
> sandwiched between two "|" characters. The second print places a caret
> ("^") under $0, as printed above, at the position of the regexp identified
> by match().
>
> Output lines 9, 12, and 15 show that " " and "/" are not word-constituent
> characters, while "a" is a word-constituent.
>
> In output lines 19, 25, and 31, searching for \y, the empty string at the
> beginning or end of a word, the results are as expected: match() returns
> the position of the first "a" in $0. Likewise, in output lines 22 and 28,
> there are no word-constituent characters in the corresponding $0, and
> match() returns 0.
>
> Lines 37-51 involve matching \B, and I don't understand those results.
>
> For output line 34, the input string "aaa" starts with a run of
> word-constituent characters. The result is 2, as expected.
>
> For output line 46, the input string "a" is exactly one word-constituent
> character. The result is 0, as expected, because there is no "empty string
> between two word-constituent characters".
>
> For output line 41, using input string " aaa", the run of
> word-constituent characters begins in position 4, yet match() returns 1. I
> would have expected 5.
>
> For output lines 38 and 44, the input strings have no word-constituent
> characters, yet match() again returns 1. I would have expected 0.
>
> For output line 50, input string "a/a/a/", there are no pairs of adjacent
> word-constituent characters, and therefore, there should be no "empty
> string between two word-constituent characters". Here, match() returns 7,
> identifying a position outside the six-character input string. Again, I
> would have expected 0.
>
> I'm not sure whether these results show a bug in Gawk (or the regexp
> library or libraries it uses), a bug in the manual, my error in
> interpretation, or some other PBCAK error. Any insights?
>
> (The gawk executable referenced in the examples was built from the most
> recent release, but I think I get the same results from an ancient gawk.)
>
> ############################################################
> ############################################################
>
> 8 echo " " | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\w/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 9 0 | |
> 10 ^
> 11 echo "a" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\w/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 12 1 |a|
> 13 ^
> 14 echo "/" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\w/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 15 0 |/|
> 16 ^
> 17
> 18 echo "aaa" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 19 1 |aaa|
> 20 ^
> 21 echo "///" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 22 0 |///|
> 23 ^
> 24 echo " aaa" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 25 4 | aaa|
> 26 ^
> 27 echo " ///" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 28 0 | ///|
> 29 ^
> 30 echo "a/a/a/" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\y/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 31 1 |a/a/a/|
> 32 ^
> 33
> 34 echo "aaa" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 35 2 |aaa|
> 36 ^
> 37 echo "///" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 38 1 |///|
> 39 ^
> 40 echo " aaa" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 41 1 | aaa|
> 42 ^
> 43 echo " ///" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 44 1 | ///|
> 45 ^
> 46 echo "a" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 47 0 |a|
> 48 ^
> 49 echo "a/a/a/" | ~/.local/bin/gawk-5.2.0 '{p=match($0, /\B/); print
> p " |" $0 "|"; s=" "; for (i=1; i<=8; i++) s=s s; print substr(s, 1,
> length(p)+2+p-1) "^";}'
> 50 7 |a/a/a/|
> 51 ^
>
> ############################################################
>
>
--
Wolfgang Laun