help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Strange whitespace remains after emoji regexp replace


From: Eli Zaretskii
Subject: Re: Strange whitespace remains after emoji regexp replace
Date: Wed, 25 Dec 2024 14:51:24 +0200

> Date: Wed, 25 Dec 2024 14:38:14 +0300
> From: Jean Louis <bugs@gnu.support>
> 
> THere is this function:
> 
> (defun wrs-search-clean-entry (entry)
>   "Clean and normalize a ENTRY string.
> 
> Prepare it for easier searching"
>   (let* ((entry (replace-regexp-in-string (rx (one-or-more (or (not alnum) 
> "\n" blank))) " " entry))
>        (entry (replace-regexp-in-string (rx (one-or-more " ")) " " entry))
>        (string-trim entry))
>     entry))
> 
> And now this emoji here, probably, creates some strange wide white
> space. I do not know if anybody can see that wide whitespace, it is
> invisible though it comes after the first quote in the result
> 
> (wrs-search-clean-entry "☺️ )(**(&&^%^$##@!))") ➜ " ️ "
> 
> It is in the above position, same as X in the below position:
> (wrs-search-clean-entry "☺️ )(**(&&^%^$##@!))") ➜ "X "
> 
> M-x describe-char
> 
> gives me:
>  
>              position: 800 of 923 (87%), column: 50
>             character: SPC (displayed as SPC) (codepoint 32, #o40, #x20)
>               charset: ascii (ASCII (ISO646 IRV))
> code point in charset: 0x20
>                script: latin
>                syntax:        which means: whitespace
>              category: .:Base, a:ASCII, l:Latin
>              to input: type "C-x 8 RET 20" or "C-x 8 RET SPACE"
>           buffer code: #x20
>             file code: not encodable by coding system nil
>               display: composed to form " ️" (see below)
> 
> Composed with the following character(s) "️" using this font:
>   ftcrhb:-GOOG-Noto Color 
> Emoji-regular-normal-normal-*-23-*-*-*-m-0-iso10646-1
> by these glyphs:
>   [0 1 32 3 29 0 0 0 0 nil]
>   [0 1 65039 3 29 0 0 0 0 [0 0 0]]
> with these character(s):
>   ️ (#xfe0f) VARIATION SELECTOR-16
> 
> Character code properties: customize what to show
>   name: SPACE
>   general-category: Zs (Separator, Space)
>   decomposition: (32) (' ')
> 
> There are text properties here:
>   fontified            t
> 
> The difference to normal space is that it has some ️ (#xfe0f)
> VARIATION SELECTOR-16
> 
> But I don't want it. I want to clean EVERYTHING what is not
> alpha-numeric from the string.
> 
> How do I make sure of it?

Remove the VS-16 character as well, how else?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]